Module zillion.datasource¶
AdHocDataTable¶
Bases: tlbx.logging_utils.PrintMixin
zillion.datasource.AdHocDataTable
(name, data, table_type, columns=None, primary_key=None, parent=None, if_exists='fail', drop_dupes=False, convert_types=None, fillna_value='', use_full_column_names=False, schema=None, **kwargs)Create an adhoc table from raw data that can be added to a DataSource
Parameters:
- name - (str) The name of the table
- data - (iterable, DataFrame, or str) The data to create the adhoc table from
- table_type - (str) Specify the TableType
- columns - (dict, optional) Column configuration for the table
- primary_key - (list of str, optional) A list of fields that make up the primary key of the table
- parent - (str, optional) A reference to a parent table in the same datasource
- if_exists - (str, optional) Control behavior when datatables already exist in the database
- drop_dupes - (bool, optional) Drop duplicate primary key rows when loading the table
- convert_types - (dict, optional) A mapping of column names to types to convert to when loading the table. The types must be strings representing valid sqlalchemy types. Ex: {"col1": "date", "col2": "integer"}
- fillna_value - (str or int, optional) Fill null values in primary key columns with this value before writing to a SQL database.
- schema - (str, optional) The schema in which the table resides
- kwargs - Keyword arguments passed to pandas.DataFrame.from_records if a DataFrame is created from iterable data
get_dataframe
(self)Get the DataFrame representation of the data
table_exists
(self, engine)Determine if this table exists
to_sql
(self, engine, method='multi', chunksize=1000)Use pandas to push the adhoc table data to a SQL database.
Parameters:
- engine - (SQLAlchemy connection engine) The engine used to connect to the database
- method - (str, optional) Passed through to pandas
- chunksize - (int, optional) Passed through to pandas
CSVDataTable¶
Bases: zillion.datasource.AdHocDataTable
zillion.datasource.CSVDataTable
(name, data, table_type, columns=None, primary_key=None, parent=None, if_exists='fail', drop_dupes=False, convert_types=None, fillna_value='', use_full_column_names=False, schema=None, **kwargs)AdHocDataTable from a JSON file using pandas.read_csv
get_dataframe
(self)table_exists
(self, engine)Determine if this table exists
to_sql
(self, engine, method='multi', chunksize=1000)Use pandas to push the adhoc table data to a SQL database.
Parameters:
- engine - (SQLAlchemy connection engine) The engine used to connect to the database
- method - (str, optional) Passed through to pandas
- chunksize - (int, optional) Passed through to pandas
DataSource¶
Bases: zillion.field.FieldManagerMixin, tlbx.logging_utils.PrintMixin
zillion.datasource.DataSource
(name, metadata=None, config=None, nlp=False)A component of a warehouse that houses one or more related tables
Parameters:
- name - (str) The name of the datasource
- metadata - (SQLAlchemy metadata, optional) A SQLAlchemy metadata object that may have zillion configuration information defined in the table and column
info.zillion
attribute - config - (dict, str, or buffer, optional) A dict adhering to the DataSourceConfigSchema or a file location to load the config from
- nlp - (bool, optional) If true, allow NLP analysis when creating fields
add_dimension
(self, dimension, force=False)Add a reference to a dimension to this FieldManager
add_metric
(self, metric, force=False)Add a reference to a metric to this FieldManager
apply_config
(self, config, reflect=False, nlp=False)Apply a datasource config to this datasource's metadata. This will also ensure zillion info is present on the metadata, populate global fields, and rebuild the datasource graph.
Parameters:
- config - (dict) The datasource config to apply
- reflect - (bool, optional) If true, use SQLAlchemy to reflect the database. Table-level reflection will also occur if any tables are created from data URLs
- nlp - (bool, optional) If true, allow NLP analysis when creating fields
directly_has_dimension
(self, name)Check if this FieldManager directly stores this dimension
directly_has_field
(self, name)Check if this FieldManager directly stores this field
directly_has_metric
(self, name)Check if this FieldManager directly stores this metric
find_descendent_tables
(self, table)Find graph descendents of the table
find_neighbor_tables
(self, table)Find tables that can be joined to or are parents of the given table
Parameters:
- table - (SQLAlchemy Table) The table to find neighbors for
Returns:
(list) - A list of NeighborTables
find_possible_table_sets
(self, ds_tables_with_field, field, grain, dimension_grain)Find table sets that meet the grain
Parameters:
- ds_tables_with_field - (list of tables) A list of datasource tables that have the target field
- field - (str) The target field we are trying to cover
- grain - (iterable) The grain the table set must support
- dimension_grain - The subset of the grain that are requested dimensions
Returns:
(list) - A list of TableSets
from_data_file
(file, primary_key, ds_name=None, schema='main', table_name=None, table_type='metric', nlp=False, **kwargs)Create a DataSource from a data file path/url that represents a single table
Parameters:
- file - (str) A file path/url pointing to a data file. If remote the file will be downloaded to the adhoc data directory.
- primary_key - (list of str) A list of fields representing the primary key of the table
- ds_name - (str, optional) The name to give the datasource
- schema - (str, optional) The schema to create the table in
- table_name - (str, optional) The name to give the table. If blank it will be derived from the file name.
- table_type - (str, optional) Specifies the TableType
- kwargs - (dict, optional) Additional keyword arguments to pass to the the AdHocDataTable
- nlp - (bool, optional) If true, allow NLP analysis when creating fields
Returns:
(DataSource) - A DataSource created from the file data and config
from_datatables
(name, datatables, config=None, nlp=False)Create a DataSource from a list of datatables
Parameters:
- name - (str) The name to give the datasource
- datatables - (list of AdHocDataTables) A list of AdHocDataTables to use to create the DataSource
- config - (dict, optional) A DataSourceConfigSchema dict config
- nlp - (bool, optional) If true, allow NLP analysis when creating fields
Returns:
(DataSource) - A DataSource created from the datatables and config
from_db_file
(file, name=None, config=None, if_exists='fail', replace_after='1 days', nlp=False)Create a DataSource from a sqlite db file path or url
Parameters:
- file - (str) A file path/url pointing to a SQLite database
- name - (str, optional) The name to give the datasource. If blank it will be derived from the file name.
- config - (dict, optional) A DataSourceConfigSchema dict config. Note that the connect param of this config will be overwritten if present!
- if_exists - (str, optional) Control behavior when the database already exists
- replace_after - (str, optional) Replace the data file after this interval if
if_exists
is "replace_after". Seeurl_connect
docs for more information. - nlp - (bool, optional) If true, allow NLP analysis when creating fields
Returns:
(DataSource) - A DataSource created from the db and config
get_child_field_managers
(self)Get a list of child FieldManagers
get_columns_with_field
(self, field_name)Get a list of column objects that support a field
get_dialect_name
(self)Get the name of the SQLAlchemy metadata dialect
get_dim_tables_with_dim
(self, dim_name)Get a list of dimension table objects with the given dimension
get_dimension
(self, obj, adhoc_fms=None)Get a reference to a dimension on this FieldManager
get_dimension_configs
(self, adhoc_fms=None)Get a dict of all dimension configs supported by this FieldManager
get_dimension_names
(self, adhoc_fms=None)Get a set of dimension names supported by this FieldManager
get_dimensions
(self, adhoc_fms=None)Get a dict of all dimensions supported by this FieldManager
get_direct_dimension_configs
(self)Get a dict of dimension configs directly supported by this FieldManager
get_direct_dimensions
(self)Get dimensions directly stored on this FieldManager
get_direct_fields
(self)Get a dict of all fields directly supported by this FieldManager
get_direct_metric_configs
(self)Get a dict of metric configs directly supported by this FieldManager
get_direct_metrics
(self)Get metrics directly stored on this FieldManager
get_field
(self, obj, adhoc_fms=None)Get a reference to a field on this FieldManager
get_field_instances
(self, field, adhoc_fms=None)Get a dict of FieldManagers (including child and adhoc FMs) that support a field
get_field_managers
(self, adhoc_fms=None)Get a list of all child FieldManagers including adhoc
get_field_names
(self, adhoc_fms=None)Get a set of field names supported by this FieldManager
get_fields
(self, adhoc_fms=None)Get a dict of all fields supported by this FieldManager
get_metric
(self, obj, adhoc_fms=None)Get a reference to a metric on this FieldManager. If the object passed is a dict it is expected to define an AdHocMetric.
get_metric_configs
(self, adhoc_fms=None)Get a dict of all metric configs supported by this FieldManager
get_metric_names
(self, adhoc_fms=None)Get a set of metric names supported by this FieldManager
get_metric_tables_with_metric
(self, metric_name)Get a list of metric table objects with the given metric
get_metrics
(self, adhoc_fms=None)Get a dict of all metrics supported by this FieldManager
get_params
(self)Get a simple dict representation of the datasource params. This is currently not sufficient to completely rebuild the datasource.
get_possible_joins
(self, table, grain)This takes a given table (usually a metric table) and tries to find one or more joins to each dimension of the grain. It's possible some of these joins satisfy other parts of the grain too which leaves room for consolidation, but it's also possible to have it generate independent, non-overlapping joins to meet the grain.
Parameters:
- table - (SQLAlchemy Table) Table to analyze for joins to grain
- grain - (iterable) An iterable of dimension names that the given table must join to
Returns:
(dict) - A mapping of dimension -> dimension joins
get_table
(self, fullname, check_active=True)Get the table object from the datasource's metadata
Parameters:
- fullname - (str) The full name of the table
- check_active - (bool, optional) If true, check that the table
Returns:
(Table) - The SQLAlchemy table object from the metadata
get_tables_with_field
(self, field_name, table_type=None)Get a list of Tables that have a field
Parameters:
- field_name - (str) The name of the field to check for
- table_type - (str, optional) Check only this TableType
Returns:
(list) - A list of Table objects
has_dimension
(self, name, adhoc_fms=None)Check whether a dimension is contained in this FieldManager
has_field
(self, name, adhoc_fms=None)Check whether a field is contained in this FieldManager
has_metric
(self, name, adhoc_fms=None)Check whether a metric is contained in this FieldManager
has_table
(self, table, check_active=True)Check whether the table is in this datasource's metadata
Parameters:
- table - (SQLAlchemy Table) A SQLAlchemy table
- check_active - (bool, optional) If true, check that the table is active
Returns:
(bool) - True if the table's fullname is in the metadata.tables map
print_dimensions
(self, indent=None)Print all dimensions in this FieldManager
print_info
(self)Print the structure of the datasource
print_metrics
(self, indent=None)Print all metrics in this FieldManager
ExcelDataTable¶
Bases: zillion.datasource.AdHocDataTable
zillion.datasource.ExcelDataTable
(name, data, table_type, columns=None, primary_key=None, parent=None, if_exists='fail', drop_dupes=False, convert_types=None, fillna_value='', use_full_column_names=False, schema=None, **kwargs)AdHocDataTable from a JSON file using pandas.read_excel
get_dataframe
(self)table_exists
(self, engine)Determine if this table exists
to_sql
(self, engine, method='multi', chunksize=1000)Use pandas to push the adhoc table data to a SQL database.
Parameters:
- engine - (SQLAlchemy connection engine) The engine used to connect to the database
- method - (str, optional) Passed through to pandas
- chunksize - (int, optional) Passed through to pandas
GoogleSheetsDataTable¶
Bases: zillion.datasource.AdHocDataTable
zillion.datasource.GoogleSheetsDataTable
(name, data, table_type, columns=None, primary_key=None, parent=None, if_exists='fail', drop_dupes=False, convert_types=None, fillna_value='', use_full_column_names=False, schema=None, **kwargs)AdHocDataTable from a google sheet. Parsed as a CSVDataTable.
get_dataframe
(self)table_exists
(self, engine)Determine if this table exists
to_sql
(self, engine, method='multi', chunksize=1000)Use pandas to push the adhoc table data to a SQL database.
Parameters:
- engine - (SQLAlchemy connection engine) The engine used to connect to the database
- method - (str, optional) Passed through to pandas
- chunksize - (int, optional) Passed through to pandas
HTMLDataTable¶
Bases: zillion.datasource.AdHocDataTable
zillion.datasource.HTMLDataTable
(name, data, table_type, columns=None, primary_key=None, parent=None, if_exists='fail', drop_dupes=False, convert_types=None, fillna_value='', use_full_column_names=False, schema=None, **kwargs)AdHocDataTable from an html table using pandas.read_html. By default it expects a table in the same format as produced by: df.reset_index().to_html("table.html", index=False)
get_dataframe
(self)table_exists
(self, engine)Determine if this table exists
to_sql
(self, engine, method='multi', chunksize=1000)Use pandas to push the adhoc table data to a SQL database.
Parameters:
- engine - (SQLAlchemy connection engine) The engine used to connect to the database
- method - (str, optional) Passed through to pandas
- chunksize - (int, optional) Passed through to pandas
JSONDataTable¶
Bases: zillion.datasource.AdHocDataTable
zillion.datasource.JSONDataTable
(name, data, table_type, columns=None, primary_key=None, parent=None, if_exists='fail', drop_dupes=False, convert_types=None, fillna_value='', use_full_column_names=False, schema=None, **kwargs)AdHocDataTable from a JSON file using pandas.read_json
get_dataframe
(self, orient='table')table_exists
(self, engine)Determine if this table exists
to_sql
(self, engine, method='multi', chunksize=1000)Use pandas to push the adhoc table data to a SQL database.
Parameters:
- engine - (SQLAlchemy connection engine) The engine used to connect to the database
- method - (str, optional) Passed through to pandas
- chunksize - (int, optional) Passed through to pandas
Join¶
Bases: tlbx.logging_utils.PrintMixin
zillion.datasource.Join
(join_parts, field_map)Represents a join (potentially multi-part) that will be part of a query
Parameters:
- join_parts - (list of JoinParts) A list of JoinParts that will make up a single Join
- field_map - (dict) The requested fields this join is meant to satisfy
add_field
(self, field)Denote that this field is covered in this join
add_fields
(self, fields)Add multiple fields as covered in this join
add_join_part_tables
(self, join_part)Add tables from join parts to the table list
combine
(join1, join2)Create a new Join object that combines the parts and fields of the given joins
get_covered_fields
(self)Generate a list of all possible fields this can cover
join_fields_for_table
(self, table_name)Get a list of join fields for a particular table in the join
join_parts_for_table
(self, table_name)Get a list of JoinParts that reference a particular table
JoinPart¶
Bases: tlbx.logging_utils.PrintMixin
zillion.datasource.JoinPart
(datasource, table_names, join_fields)A part of a join that defines a join between two particular tables
NeighborTable¶
Bases: tlbx.logging_utils.PrintMixin
zillion.datasource.NeighborTable
(table, join_fields)Represents a neighboring node in the datasource graph
SQLiteDataTable¶
Bases: zillion.datasource.AdHocDataTable
zillion.datasource.SQLiteDataTable
(name, data, table_type, columns=None, primary_key=None, parent=None, if_exists='fail', drop_dupes=False, convert_types=None, fillna_value='', use_full_column_names=False, schema=None, **kwargs)AdHocDataTable from an existing sqlite database on the local filesystem
Note: the "data" param to AdHocDataTable is ignored. This is simply a workaround to get an AdHocDataTable reference for an existing SQLite DB without having to recreate anything from data.
get_dataframe
(self)table_exists
(self, engine)Determine if this table exists
to_sql
(self, engine, **kwargs)TableSet¶
Bases: tlbx.logging_utils.PrintMixin
zillion.datasource.TableSet
(datasource, ds_table, join, grain, target_fields)A set of tables in a datasource that can meet a grain and provide target fields.
Parameters:
- datasource - (DataSource) The DataSource containing all tables
- ds_table - (Table) A table containing a desired metric or dimension
- join - (Join) A join to related tables that satisfies the grain and provides the target fields
- grain - (list of str) A list of dimensions that must be supported by the join
- target_fields - (list of str) A list of fields being targeted
get_covered_fields
(self)Get a list of all covered fields in this table set
get_covered_metrics
(self, wh)Get a list of metrics covered by this table set
Parameters:
- wh - (Warehouse) The warehouse to use as a reference for metric fields
Returns:
(list of str) - A list of metric names covered in this TableSet
connect_url_to_metadata¶
zillion.datasource.connect_url_to_metadata
(url, ds_name=None)Create a bound SQLAlchemy MetaData object from a database URL. The ds_name param is used to determine datasource config context for variable substitution.
data_url_to_metadata¶
zillion.datasource.data_url_to_metadata
(data_url, ds_name, if_exists='fail', replace_after='1 days')Create a bound SQLAlchemy MetaData object from a data URL. The ds_name param is used to determine datasource config context for variable substitution.
datatable_from_config¶
zillion.datasource.datatable_from_config
(name, config, schema=None, **kwargs)Factory to create an AdHocDataTable from a given config. The type of the AdHocDataTable created will be inferred from the config["url"] param.
Parameters:
- name - (str) The name of the table
- config - (dict) The configuration of the table
- schema - (str, optional) The schema in which the table resides
- kwargs - Passed to init of the particular AdHocDataTable class
Returns:
(AdHocDataTable) - Return the created AdHocDataTable (subclass)
entity_name_from_file¶
zillion.datasource.entity_name_from_file
(filename)get_adhoc_datasource_filename¶
zillion.datasource.get_adhoc_datasource_filename
(ds_name)Get the filename where the adhoc datasource will be located
get_adhoc_datasource_url¶
zillion.datasource.get_adhoc_datasource_url
(ds_name)Get a connection URL for the datasource
get_ds_config_context¶
zillion.datasource.get_ds_config_context
(name)Helper to get datasource context from the zillion config
join_from_path¶
zillion.datasource.join_from_path
(ds, path, field_map=None)Given a path in the datasource graph, get the corresponding Join
Parameters:
- ds - (DataSource) The datasource for the join
- path - (list of str) A list of tables that form a join path
- field_map - (dict, optional) Passed through to Join init
Returns:
(Join) - A Join between all tables in the path
metadata_from_connect¶
zillion.datasource.metadata_from_connect
(connect, ds_name)Create a bound SQLAlchemy MetaData object from a "connect" param. The connect value may be a connection string or a DataSourceConnectSchema dict. See the DataSourceConnectSchema docs for more details on that format.
parse_replace_after¶
zillion.datasource.parse_replace_after
(replace_after)Parse a case-insensitive interval string of the format "number interval". The number may be a float, and the inverval options are: seconds, minutes, hours, days, weeks.
populate_url_context¶
zillion.datasource.populate_url_context
(url, ds_name)Helper to do variable replacement in URLs
reflect_metadata¶
zillion.datasource.reflect_metadata
(metadata, reflect_only=None)Reflect the metadata object from the connection. If reflect_only is passed, reflect only the tables specified in that list
url_connect¶
zillion.datasource.url_connect
(ds_name, connect_url=None, data_url=None, if_exists='fail', replace_after='1 days')A URL-based datasource connector. This is meant to be used as the "func" value of a DataSourceConnectSchema. Only one of connect_url or data_url may be specified.
Parameters:
- ds_name - (str) The name of the datasource to get a connection for
- connect_url - (str, optional) If a connect_url is passed, it will create a bound MetaData object from that connection string.
- data_url - (str, optional) If a data_url is passed, it will first download that data (or make sure it is already downloaded) and then create a connection to that data file, which is assumed to be a SQLite database. The name of the database file will be based on the name of the datasource passed in.
- if_exists - (str, optional) If a data_url is in use, this will control handling of existing data under the same filename. If "fail", an exception will be raised if the file already exists. If "ignore", it will skip downloading the file if it exists. If "replace", it will create or replace the file. If "replace_after" it will check the age of the file and replace it after the age interval provided in the
replace_after
param. - replace_after - (str, optional) If
if_exists
is "replace_after, use this value to determine the age threshold. The format is "number interval", where the number may be an int or float and the interval options are: seconds, minutes, hours, days, weeks. Note that this only occurs whenurl_connect
is called, which is typically on datasource init; it does not replace itself periodically while the datasource is instantiated.