Module zillion.datasource¶

AdHocDataTable ¶

Bases: tlbx.logging_utils.PrintMixin

class zillion.datasource.AdHocDataTable(name, data, table_type, columns=None, primary_key=None, parent=None, if_exists='fail', drop_dupes=False, convert_types=None, fillna_value='', use_full_column_names=False, schema=None, **kwargs)

Create an adhoc table from raw data that can be added to a DataSource

Parameters:

name - (str) The name of the table
data - (iterable, DataFrame, or str) The data to create the adhoc table from
table_type - (str) Specify the TableType
columns - (dict, optional) Column configuration for the table
primary_key - (list of str, optional) A list of fields that make up the primary key of the table
parent - (str, optional) A reference to a parent table in the same datasource
if_exists - (str, optional) Control behavior when datatables already exist in the database
drop_dupes - (bool, optional) Drop duplicate primary key rows when loading the table
convert_types - (dict, optional) A mapping of column names to types to convert to when loading the table. The types must be strings representing valid sqlalchemy types. Ex: {"col1": "date", "col2": "integer"}
fillna_value - (str or int, optional) Fill null values in primary key columns with this value before writing to a SQL database.
schema - (str, optional) The schema in which the table resides
kwargs - Keyword arguments passed to pandas.DataFrame.from_records if a DataFrame is created from iterable data

get_dataframe(self)

Get the DataFrame representation of the data

table_exists(self, engine)

Determine if this table exists

to_sql(self, engine, method='multi', chunksize=1000)

Use pandas to push the adhoc table data to a SQL database.

Parameters:

engine - (SQLAlchemy connection engine) The engine used to connect to the database
method - (str, optional) Passed through to pandas
chunksize - (int, optional) Passed through to pandas

CSVDataTable ¶

Bases: zillion.datasource.AdHocDataTable

class zillion.datasource.CSVDataTable(name, data, table_type, columns=None, primary_key=None, parent=None, if_exists='fail', drop_dupes=False, convert_types=None, fillna_value='', use_full_column_names=False, schema=None, **kwargs)

AdHocDataTable from a JSON file using pandas.read_csv

get_dataframe(self)

table_exists(self, engine)

Determine if this table exists

to_sql(self, engine, method='multi', chunksize=1000)

Use pandas to push the adhoc table data to a SQL database.

Parameters:

engine - (SQLAlchemy connection engine) The engine used to connect to the database
method - (str, optional) Passed through to pandas
chunksize - (int, optional) Passed through to pandas

DataSource ¶

Bases: zillion.field.FieldManagerMixin, tlbx.logging_utils.PrintMixin

class zillion.datasource.DataSource(name, metadata=None, config=None, nlp=False)

A component of a warehouse that houses one or more related tables

Parameters:

name - (str) The name of the datasource
metadata - (SQLAlchemy metadata, optional) A SQLAlchemy metadata object that may have zillion configuration information defined in the table and column info.zillion attribute
config - (dict, str, or buffer, optional) A dict adhering to the DataSourceConfigSchema or a file location to load the config from
nlp - (bool, optional) If true, allow NLP analysis when creating fields

add_dimension(self, dimension, force=False)

Add a reference to a dimension to this FieldManager

add_metric(self, metric, force=False)

Add a reference to a metric to this FieldManager

apply_config(self, config, reflect=False, nlp=False)

Apply a datasource config to this datasource's metadata. This will also ensure zillion info is present on the metadata, populate global fields, and rebuild the datasource graph.

Parameters:

config - (dict) The datasource config to apply
reflect - (bool, optional) If true, use SQLAlchemy to reflect the database. Table-level reflection will also occur if any tables are created from data URLs
nlp - (bool, optional) If true, allow NLP analysis when creating fields

directly_has_dimension(self, name)

Check if this FieldManager directly stores this dimension

directly_has_field(self, name)

Check if this FieldManager directly stores this field

directly_has_metric(self, name)

Check if this FieldManager directly stores this metric

find_descendent_tables(self, table)

Find graph descendents of the table

find_neighbor_tables(self, table)

Find tables that can be joined to or are parents of the given table

Parameters:

table - (SQLAlchemy Table) The table to find neighbors for

Returns:

(list) - A list of NeighborTables

find_possible_table_sets(self, ds_tables_with_field, field, grain, dimension_grain)

Find table sets that meet the grain

Parameters:

ds_tables_with_field - (list of tables) A list of datasource tables that have the target field
field - (str) The target field we are trying to cover
grain - (iterable) The grain the table set must support
dimension_grain - The subset of the grain that are requested dimensions

Returns:

(list) - A list of TableSets

from_data_file(file, primary_key, ds_name=None, schema='main', table_name=None, table_type='metric', nlp=False, **kwargs)

Create a DataSource from a data file path/url that represents a single table

Parameters:

file - (str) A file path/url pointing to a data file. If remote the file will be downloaded to the adhoc data directory.
primary_key - (list of str) A list of fields representing the primary key of the table
ds_name - (str, optional) The name to give the datasource
schema - (str, optional) The schema to create the table in
table_name - (str, optional) The name to give the table. If blank it will be derived from the file name.
table_type - (str, optional) Specifies the TableType
kwargs - (dict, optional) Additional keyword arguments to pass to the the AdHocDataTable
nlp - (bool, optional) If true, allow NLP analysis when creating fields

Returns:

(DataSource) - A DataSource created from the file data and config

from_datatables(name, datatables, config=None, nlp=False)

Create a DataSource from a list of datatables

Parameters:

name - (str) The name to give the datasource
datatables - (list of AdHocDataTables) A list of AdHocDataTables to use to create the DataSource
config - (dict, optional) A DataSourceConfigSchema dict config
nlp - (bool, optional) If true, allow NLP analysis when creating fields

Returns:

(DataSource) - A DataSource created from the datatables and config

from_db_file(file, name=None, config=None, if_exists='fail', replace_after='1 days', nlp=False)

Create a DataSource from a sqlite db file path or url

Parameters:

file - (str) A file path/url pointing to a SQLite database
name - (str, optional) The name to give the datasource. If blank it will be derived from the file name.
config - (dict, optional) A DataSourceConfigSchema dict config. Note that the connect param of this config will be overwritten if present!
if_exists - (str, optional) Control behavior when the database already exists
replace_after - (str, optional) Replace the data file after this interval if if_exists is "replace_after". See url_connect docs for more information.
nlp - (bool, optional) If true, allow NLP analysis when creating fields

Returns:

(DataSource) - A DataSource created from the db and config

get_child_field_managers(self)

Get a list of child FieldManagers

get_columns_with_field(self, field_name)

Get a list of column objects that support a field

get_dialect_name(self)

Get the name of the SQLAlchemy metadata dialect

get_dim_tables_with_dim(self, dim_name)

Get a list of dimension table objects with the given dimension

get_dimension(self, obj, adhoc_fms=None)

Get a reference to a dimension on this FieldManager

get_dimension_configs(self, adhoc_fms=None)

Get a dict of all dimension configs supported by this FieldManager

get_dimension_names(self, adhoc_fms=None)

Get a set of dimension names supported by this FieldManager

get_dimensions(self, adhoc_fms=None)

Get a dict of all dimensions supported by this FieldManager

get_direct_dimension_configs(self)

Get a dict of dimension configs directly supported by this FieldManager

get_direct_dimensions(self)

Get dimensions directly stored on this FieldManager

get_direct_fields(self)

Get a dict of all fields directly supported by this FieldManager

get_direct_metric_configs(self)

Get a dict of metric configs directly supported by this FieldManager

get_direct_metrics(self)

Get metrics directly stored on this FieldManager

get_field(self, obj, adhoc_fms=None)

Get a reference to a field on this FieldManager

get_field_instances(self, field, adhoc_fms=None)

Get a dict of FieldManagers (including child and adhoc FMs) that support a field

get_field_managers(self, adhoc_fms=None)

Get a list of all child FieldManagers including adhoc

get_field_names(self, adhoc_fms=None)

Get a set of field names supported by this FieldManager

get_fields(self, adhoc_fms=None)

Get a dict of all fields supported by this FieldManager

get_metric(self, obj, adhoc_fms=None)

Get a reference to a metric on this FieldManager. If the object passed is a dict it is expected to define an AdHocMetric.

get_metric_configs(self, adhoc_fms=None)

Get a dict of all metric configs supported by this FieldManager

get_metric_names(self, adhoc_fms=None)

Get a set of metric names supported by this FieldManager

get_metric_tables_with_metric(self, metric_name)

Get a list of metric table objects with the given metric

get_metrics(self, adhoc_fms=None)

Get a dict of all metrics supported by this FieldManager

get_params(self)

Get a simple dict representation of the datasource params. This is currently not sufficient to completely rebuild the datasource.

get_possible_joins(self, table, grain)

This takes a given table (usually a metric table) and tries to find one or more joins to each dimension of the grain. It's possible some of these joins satisfy other parts of the grain too which leaves room for consolidation, but it's also possible to have it generate independent, non-overlapping joins to meet the grain.

Parameters:

table - (SQLAlchemy Table) Table to analyze for joins to grain
grain - (iterable) An iterable of dimension names that the given table must join to

Returns:

(dict) - A mapping of dimension -> dimension joins

get_table(self, fullname, check_active=True)

Get the table object from the datasource's metadata

Parameters:

fullname - (str) The full name of the table
check_active - (bool, optional) If true, check that the table

Returns:

(Table) - The SQLAlchemy table object from the metadata

get_tables_with_field(self, field_name, table_type=None)

Get a list of Tables that have a field

Parameters:

field_name - (str) The name of the field to check for
table_type - (str, optional) Check only this TableType

Returns:

(list) - A list of Table objects

has_dimension(self, name, adhoc_fms=None)

Check whether a dimension is contained in this FieldManager

has_field(self, name, adhoc_fms=None)

Check whether a field is contained in this FieldManager

has_metric(self, name, adhoc_fms=None)

Check whether a metric is contained in this FieldManager

has_table(self, table, check_active=True)

Check whether the table is in this datasource's metadata

Parameters:

table - (SQLAlchemy Table) A SQLAlchemy table
check_active - (bool, optional) If true, check that the table is active

Returns:

(bool) - True if the table's fullname is in the metadata.tables map

print_dimensions(self, indent=None)

Print all dimensions in this FieldManager

print_info(self)

Print the structure of the datasource

print_metrics(self, indent=None)

Print all metrics in this FieldManager

ExcelDataTable ¶

Bases: zillion.datasource.AdHocDataTable

class zillion.datasource.ExcelDataTable(name, data, table_type, columns=None, primary_key=None, parent=None, if_exists='fail', drop_dupes=False, convert_types=None, fillna_value='', use_full_column_names=False, schema=None, **kwargs)

AdHocDataTable from a JSON file using pandas.read_excel

get_dataframe(self)

table_exists(self, engine)

Determine if this table exists

to_sql(self, engine, method='multi', chunksize=1000)

Use pandas to push the adhoc table data to a SQL database.

Parameters:

engine - (SQLAlchemy connection engine) The engine used to connect to the database
method - (str, optional) Passed through to pandas
chunksize - (int, optional) Passed through to pandas

GoogleSheetsDataTable ¶

Bases: zillion.datasource.AdHocDataTable

class zillion.datasource.GoogleSheetsDataTable(name, data, table_type, columns=None, primary_key=None, parent=None, if_exists='fail', drop_dupes=False, convert_types=None, fillna_value='', use_full_column_names=False, schema=None, **kwargs)

AdHocDataTable from a google sheet. Parsed as a CSVDataTable.

get_dataframe(self)

table_exists(self, engine)

Determine if this table exists

to_sql(self, engine, method='multi', chunksize=1000)

Use pandas to push the adhoc table data to a SQL database.

Parameters:

engine - (SQLAlchemy connection engine) The engine used to connect to the database
method - (str, optional) Passed through to pandas
chunksize - (int, optional) Passed through to pandas

HTMLDataTable ¶

Bases: zillion.datasource.AdHocDataTable

class zillion.datasource.HTMLDataTable(name, data, table_type, columns=None, primary_key=None, parent=None, if_exists='fail', drop_dupes=False, convert_types=None, fillna_value='', use_full_column_names=False, schema=None, **kwargs)

AdHocDataTable from an html table using pandas.read_html. By default it expects a table in the same format as produced by: df.reset_index().to_html("table.html", index=False)

get_dataframe(self)

table_exists(self, engine)

Determine if this table exists

to_sql(self, engine, method='multi', chunksize=1000)

Use pandas to push the adhoc table data to a SQL database.

Parameters:

engine - (SQLAlchemy connection engine) The engine used to connect to the database
method - (str, optional) Passed through to pandas
chunksize - (int, optional) Passed through to pandas

JSONDataTable ¶

Bases: zillion.datasource.AdHocDataTable

class zillion.datasource.JSONDataTable(name, data, table_type, columns=None, primary_key=None, parent=None, if_exists='fail', drop_dupes=False, convert_types=None, fillna_value='', use_full_column_names=False, schema=None, **kwargs)

AdHocDataTable from a JSON file using pandas.read_json

get_dataframe(self, orient='table')

table_exists(self, engine)

Determine if this table exists

to_sql(self, engine, method='multi', chunksize=1000)

Use pandas to push the adhoc table data to a SQL database.

Parameters:

engine - (SQLAlchemy connection engine) The engine used to connect to the database
method - (str, optional) Passed through to pandas
chunksize - (int, optional) Passed through to pandas

Join ¶

Bases: tlbx.logging_utils.PrintMixin

class zillion.datasource.Join(join_parts, field_map)

Represents a join (potentially multi-part) that will be part of a query

Parameters:

join_parts - (list of JoinParts) A list of JoinParts that will make up a single Join
field_map - (dict) The requested fields this join is meant to satisfy

add_field(self, field)

Denote that this field is covered in this join

add_fields(self, fields)

Add multiple fields as covered in this join

add_join_part_tables(self, join_part)

Add tables from join parts to the table list

combine(join1, join2)

Create a new Join object that combines the parts and fields of the given joins

get_covered_fields(self)

Generate a list of all possible fields this can cover

join_fields_for_table(self, table_name)

Get a list of join fields for a particular table in the join

join_parts_for_table(self, table_name)

Get a list of JoinParts that reference a particular table

JoinPart ¶

Bases: tlbx.logging_utils.PrintMixin

class zillion.datasource.JoinPart(datasource, table_names, join_fields)

A part of a join that defines a join between two particular tables

NeighborTable ¶

Bases: tlbx.logging_utils.PrintMixin

class zillion.datasource.NeighborTable(table, join_fields)

Represents a neighboring node in the datasource graph

SQLiteDataTable ¶

Bases: zillion.datasource.AdHocDataTable

class zillion.datasource.SQLiteDataTable(name, data, table_type, columns=None, primary_key=None, parent=None, if_exists='fail', drop_dupes=False, convert_types=None, fillna_value='', use_full_column_names=False, schema=None, **kwargs)

AdHocDataTable from an existing sqlite database on the local filesystem

Note: the "data" param to AdHocDataTable is ignored. This is simply a workaround to get an AdHocDataTable reference for an existing SQLite DB without having to recreate anything from data.

get_dataframe(self)

table_exists(self, engine)

Determine if this table exists

to_sql(self, engine, **kwargs)

TableSet ¶

Bases: tlbx.logging_utils.PrintMixin

class zillion.datasource.TableSet(datasource, ds_table, join, grain, target_fields)

A set of tables in a datasource that can meet a grain and provide target fields.

Parameters:

datasource - (DataSource) The DataSource containing all tables
ds_table - (Table) A table containing a desired metric or dimension
join - (Join) A join to related tables that satisfies the grain and provides the target fields
grain - (list of str) A list of dimensions that must be supported by the join
target_fields - (list of str) A list of fields being targeted

get_covered_fields(self)

Get a list of all covered fields in this table set

get_covered_metrics(self, wh)

Get a list of metrics covered by this table set

Parameters:

wh - (Warehouse) The warehouse to use as a reference for metric fields

Returns:

(list of str) - A list of metric names covered in this TableSet

connect_url_to_metadata ¶

zillion.datasource.connect_url_to_metadata(url, ds_name=None)

Create a bound SQLAlchemy MetaData object from a database URL. The ds_name param is used to determine datasource config context for variable substitution.

data_url_to_metadata ¶

zillion.datasource.data_url_to_metadata(data_url, ds_name, if_exists='fail', replace_after='1 days')

Create a bound SQLAlchemy MetaData object from a data URL. The ds_name param is used to determine datasource config context for variable substitution.

datatable_from_config ¶

zillion.datasource.datatable_from_config(name, config, schema=None, **kwargs)

Factory to create an AdHocDataTable from a given config. The type of the AdHocDataTable created will be inferred from the config["url"] param.

Parameters:

name - (str) The name of the table
config - (dict) The configuration of the table
schema - (str, optional) The schema in which the table resides
kwargs - Passed to init of the particular AdHocDataTable class

Returns:

(AdHocDataTable) - Return the created AdHocDataTable (subclass)

entity_name_from_file ¶

zillion.datasource.entity_name_from_file(filename)

get_adhoc_datasource_filename ¶

zillion.datasource.get_adhoc_datasource_filename(ds_name)

Get the filename where the adhoc datasource will be located

get_adhoc_datasource_url ¶

zillion.datasource.get_adhoc_datasource_url(ds_name)

Get a connection URL for the datasource

get_ds_config_context ¶

zillion.datasource.get_ds_config_context(name)

Helper to get datasource context from the zillion config

get_engine_extra_kwargs ¶

zillion.datasource.get_engine_extra_kwargs(url)

Get the extra kwargs for the SQLAlchemy engine based on the URL

join_from_path ¶

zillion.datasource.join_from_path(ds, path, field_map=None)

Given a path in the datasource graph, get the corresponding Join

Parameters:

ds - (DataSource) The datasource for the join
path - (list of str) A list of tables that form a join path
field_map - (dict, optional) Passed through to Join init

Returns:

(Join) - A Join between all tables in the path

Parameters:

ds_name - (str) The name of the datasource to get a connection for
connect_url - (str, optional) If a connect_url is passed, it will create a bound MetaData object from that connection string.
data_url - (str, optional) If a data_url is passed, it will first download that data (or make sure it is already downloaded) and then create a connection to that data file, which is assumed to be a SQLite database. The name of the database file will be based on the name of the datasource passed in.
if_exists - (str, optional) If a data_url is in use, this will control handling of existing data under the same filename. If "fail", an exception will be raised if the file already exists. If "ignore", it will skip downloading the file if it exists. If "replace", it will create or replace the file. If "replace_after" it will check the age of the file and replace it after the age interval provided in the replace_after param.
replace_after - (str, optional) If if_exists is "replace_after, use this value to determine the age threshold. The format is "number interval", where the number may be an int or float and the interval options are: seconds, minutes, hours, days, weeks. Note that this only occurs when url_connect is called, which is typically on datasource init; it does not replace itself periodically while the datasource is instantiated.

Module zillion.datasource¶

AdHocDataTable ¶

CSVDataTable ¶

DataSource ¶

ExcelDataTable ¶

GoogleSheetsDataTable ¶

HTMLDataTable ¶

JSONDataTable ¶

Join ¶

JoinPart ¶

NeighborTable ¶

SQLiteDataTable ¶

TableSet ¶

connect_url_to_metadata ¶

data_url_to_metadata ¶

datatable_from_config ¶

entity_name_from_file ¶

get_adhoc_datasource_filename ¶

get_adhoc_datasource_url ¶

get_ds_config_context ¶

get_engine_extra_kwargs ¶

join_from_path ¶

metadata_from_connect ¶

parse_replace_after ¶

populate_url_context ¶

reflect_metadata ¶

url_connect ¶