IntentTransition Class

class ds_discovery.intent.transition_intent.TransitionIntentModel(property_manager: ~ds_discovery.managers.transition_property_manager.TransitionPropertyManager, default_save_intent: bool = None, default_intent_level: [<class 'str'>, <class 'int'>, <class 'float'>] = None, order_next_available: bool = None, default_replace_intent: bool = None)

This component provides a set of actions that focuses on tidying raw data by removing data columns that are not useful to the final feature set, also known as data selection. These may include null columns, single value columns, duplicate columns and noise etc. We can also ensure the data is properly canonicalized through enforcing data typing.

auto_clean_header(df, case=None, rename_map: [<class 'dict'>, <class 'str'>] = None, replace_spaces: str = None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

clean the headers of a pandas DataFrame replacing space with underscore. If the rename_map is passed as a name of a connector contract

Parameters:
  • df – the pandas.DataFrame to drop duplicates from

  • rename_map – a dict of name value pairs or connector name for column mapping

  • case – changes the headers to lower, upper, title, snake. if none of these then no change

  • replace_spaces – character to replace spaces with. Default is ‘_’ (underscore)

  • inplace – if the passed pandas.DataFrame should be used or a deep copy

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.

auto_drop_columns(df, null_min: float = None, predominant_max: float = None, nulls_list: [<class 'bool'>, <class 'list'>] = None, drop_predominant: bool = None, drop_empty_row: bool = None, drop_unknown: bool = None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>, None]

auto removes columns that are at least 0.998 percent np.NaN, a single value, std equal zero or have a predominant value greater than the default 0.998 percent.

Parameters:
  • df – the pandas.DataFrame to auto remove

  • null_min – the minimum number of null values default to 0.998 (99.8%) nulls

  • predominant_max – the percentage max a single field predominates default is 0.998 (99.8%) unique value

  • nulls_list – can be boolean or a list: if boolean and True then null_list equals [‘NaN’, ‘nan’, ‘null’, ‘’, ‘None’, ‘ ‘] if list then this is considered potential null values.

  • drop_predominant – drop columns that have a predominant value of the given predominant max

  • drop_empty_row – also drop any rows where all the values are empty

  • drop_unknown – (optional) drop objects that are not string types such as binary

  • inplace – if to change the passed pandas.DataFrame or return a copy (see return)

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.

auto_drop_correlated(df: ~pandas.core.frame.DataFrame, threshold: float = None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>]

uses ‘brute force’ techniques to removes highly correlated numeric columns based on the threshold, set by default to 0.998.

Parameters:
  • df – data: the Canonical data to drop duplicates from

  • threshold – (optional) threshold correlation between columns. default 0.998

  • inplace – if the passed Canonical, should be used or a deep copy

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

if inplace, returns a formatted cleaner contract for this method, else a deep copy Canonical,.

auto_drop_duplicates(df: ~pandas.core.frame.DataFrame, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>]

Removes columns that are duplicates of each other

Parameters:
  • df – data: the Canonical data to drop duplicates from

  • inplace – if the passed Canonical, should be used or a deep copy

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

if inplace, returns a formatted cleaner contract for this method, else a deep copy Canonical,.

auto_projection(df, headers: list = None, drop: bool = None, n_components: [<class 'int'>, <class 'float'>] = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None, **kwargs) DataFrame

Principal component analysis (PCA) is a linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space.

Parameters:
  • df – a pd.DataFrame as the reference dataframe

  • headers – (optional) a list off headers so select (default) or drop from the dataset

  • drop – (optional) if True then srop the headers. False by default

  • n_components – (optional) Number of components to keep.

  • seed – (optional) placeholder

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

  • kwargs – additional parameters to pass the PCA model

Returns:

a pd.DataFrame

auto_reinstate_nulls(df, nulls_list=None, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>, None]

automatically reinstates nulls that have been masked with alternate values such as space or question-mark. By default, the nulls list is [‘’,’ ‘,’NaN’,’nan’,’None’,’null’,’Null’,’NULL’]

Parameters:
  • df – the pandas DataFrame to remove null rows from

  • nulls_list – (optional) potential null values to replace with a null.

  • headers – a list of headers to drop or filter on type

  • drop – to drop or not drop the headers

  • dtype – the column types to include or exclude. Default None else int, float, bool, object, ‘number’

  • exclude – to exclude or include the dtypes

  • regex – a regular expression to search the headers

  • re_ignore_case – true if the regex should ignore case. Default is False

  • inplace – (optional)if the passed pandas.DataFrame should be used or a deep copy

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.

auto_remove_null_rows(df, nulls_list: list = None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>, None]

automatically removes rows where the full row is null

Parameters:
  • df – the pandas DataFrame to remove null rows from

  • nulls_list – (optional) potential null values to consider other than just np.nan

  • inplace – (optional) if the passed pandas.DataFrame should be used or a deep copy

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.

auto_to_category(df: ~pandas.core.frame.DataFrame, unique_max: int = None, null_max: float = None, fill_nulls: str = None, nulls_list: list = None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>, None]

auto categorises columns that have a max number of uniqueness with a min number of nulls and are object dtype

Parameters:
  • df – the pandas.DataFrame to auto categorise

  • unique_max – the max number of unique values in the column. default to 20

  • null_max – maximum number of null in the column between 0 and 1. default to 0.7 (70% nulls allowed)

  • fill_nulls – a value to fill nulls that then can be identified as a category type

  • nulls_list – potential null values to replace.

  • inplace – if the passed pandas.DataFrame should be used or a deep copy

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.

auto_to_date(df: ~pandas.core.frame.DataFrame, timezone: str = None, day_first: bool = None, year_first: bool = None, date_format: str = None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

looks through the dataset for valid date formats and converts them to a common datetime.

Parameters:
  • df – the Pandas.DataFrame to get the column headers from

  • inplace – if the passed pandas.DataFrame should be used or a deep copy

  • timezone – set the timezone else data set to native

  • year_first – specifies if to parse with the year first If True parses dates with the year first, eg 10/11/12 is parsed as 2010-11-12. If both dayfirst and yearfirst are True, yearfirst is preceded (same as dateutil).

  • day_first – specifies if to parse with the day first If True, parses dates with the day first, eg %d-%m-%Y. If False default to the a preferred preference, normally %m-%d-%Y (but not strict)

  • date_format – if the date can’t be inferred uses date format eg format=’%Y%m%d’

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.

auto_transition(df, unique_max: int = None, null_max: float = None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>, None]

automatically tries to convert a passes DataFrame to appropriate types

Parameters:
  • df – the pandas DataFrame to remove null rows from

  • unique_max – the max number of unique values in the column. default to 20

  • null_max – maximum number of null in the column between 0 and 1. default to 0.7 (70% nulls allowed)

  • inplace – if the passed pandas.DataFrame should be used or a deep copy

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.

run_intent_pipeline(canonical: ~pandas.core.frame.DataFrame, intent_levels: [<class 'int'>, <class 'str'>, <class 'list'>] = None, run_book: str = None, **kwargs)

Collectively runs all parameterised intent taken from the property manager against the code base as defined by the intent_contract.

It is expected that all intent methods have the ‘canonical’ as the first parameter of the method signature and will contain ‘inplace’ and ‘save_intent’ as parameters.

Parameters:
  • canonical – this is the iterative value all intent are applied to and returned.

  • intent_levels – (optional) an single or list of levels to run, if list, run in order given

  • run_book – (optional) a preset runbook of intent_level to run in order

  • kwargs – additional kwargs to add to the parameterised intent, these will replace any that already exist

Returns:

Canonical with parameterised intent applied or None if inplace is True

to_bool_type(df: ~pandas.core.frame.DataFrame, bool_map: dict = None, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>, None]

converts column to bool based on the map

Parameters:
  • df – the Pandas.DataFrame to get the column headers from

  • bool_map – a mapping of what to make True and False

  • headers – a list of headers to drop or filter on type

  • drop – to drop or not drop the headers

  • dtype – the column types to include or exclude. Default None else int, float, bool, object, ‘number’

  • exclude – to exclude or include the dtypes

  • regex – a regular expression to search the headers

  • re_ignore_case – true if the regex should ignore case. Default is False

  • inplace – if the passed pandas.DataFrame should be used or a deep copy

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.

to_category_type(df: ~pandas.core.frame.DataFrame, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, inplace: bool = None, as_num: bool = None, fill_nulls: str = None, nulls_list: list = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>, None]

converts columns to categories

Parameters:
  • df – the Pandas.DataFrame to get the column headers from

  • headers – a list of headers to drop or filter on type

  • drop – to drop or not drop the headers

  • dtype – the column types to include or exclude. Default None else int, float, bool, object, ‘number’

  • exclude – to exclude or include the dtypes

  • regex – a regular expression to search the headers

  • re_ignore_case – true if the regex should ignore case. Default is False

  • as_num – if true returns the category as a category code

  • fill_nulls – a value to fill nulls that then can be identified as a category type

  • nulls_list – potential null values to replace.

  • inplace – if the passed pandas.DataFrame should be used or a deep copy

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.

to_date_element(df: ~pandas.core.frame.DataFrame, matrix: [<class 'str'>, <class 'list'>], headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, day_first: bool = None, year_first: bool = None, date_format: str = None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>]

breaks a date down into value representations of the various parts that date.

Parameters:
  • df – the Pandas.DataFrame to get the column headers from

  • matrix – the matrix options (see below)

  • headers – a list of headers to drop or filter on type

  • drop – to drop or not drop the headers

  • dtype – the column types to include or exclude. Default None else int, float, bool, object, ‘number’

  • exclude – to exclude or include the dtypes

  • regex – a regular expression to search the headers

  • re_ignore_case – true if the regex should ignore case. Default is False

  • inplace – if the passed pandas.DataFrame should be used or a deep copy

  • year_first – specifies if to parse with the year first If True parses dates with the year first, eg 10/11/12 is parsed as 2010-11-12. If both dayfirst and yearfirst are True, yearfirst is preceded (same as dateutil).

  • day_first – specifies if to parse with the day first If True, parses dates with the day first, eg %d-%m-%Y. If False default to the a prefered preference, normally %m-%d-%Y (but not strict)

  • date_format – if the date can’t be inferred uses date format eg format=’%Y%m%d’

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.

Matrix options are: - yr: year - dec: decade - mon: month - day: day - dow: day of week - hr: hour - min: minute - woy: week of year = doy: day of year

to_date_from_excel_type(df: ~pandas.core.frame.DataFrame, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>, None]

converts excel date formats into datetime

Parameters:
  • df – the Pandas.DataFrame to get the column headers from

  • headers – a list of headers to drop or filter on type

  • drop – to drop or not drop the headers

  • dtype – the column types to include or exclude. Default None else int, float, bool, object, ‘number’

  • exclude – to exclude or include the dtypes

  • regex – a regular expression to search the headers

  • re_ignore_case – true if the regex should ignore case. Default is False

  • inplace – if the passed pandas.DataFrame should be used or a deep copy

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.

to_date_type(df: ~pandas.core.frame.DataFrame, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, timezone: str = None, day_first: bool = None, year_first: bool = None, date_format: str = None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>]

converts columns to date types

Parameters:
  • df – the Pandas.DataFrame to get the column headers from

  • headers – a list of headers to drop or filter on type

  • drop – to drop or not drop the headers

  • dtype – the column types to include or exclude. Default None else int, float, bool, object, ‘number’

  • exclude – to exclude or include the dtypes

  • regex – a regular expression to search the headers

  • re_ignore_case – true if the regex should ignore case. Default is False

  • inplace – if the passed pandas.DataFrame should be used or a deep copy

  • timezone – set the timezone else data set to native

  • year_first – specifies if to parse with the year first If True parses dates with the year first, eg 10/11/12 is parsed as 2010-11-12. If both dayfirst and yearfirst are True, yearfirst is preceded (same as dateutil).

  • day_first – specifies if to parse with the day first If True, parses dates with the day first, eg %d-%m-%Y. If False default to the a prefered preference, normally %m-%d-%Y (but not strict)

  • date_format – if the date can’t be inferred uses date format eg format=’%Y%m%d’

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.

to_float_type(df: ~pandas.core.frame.DataFrame, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, precision=None, fillna=None, errors=None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>, None]

converts columns to float type

Parameters:
  • df – the Pandas.DataFrame to get the column headers from

  • headers – a list of headers to drop or filter on type

  • drop – to drop or not drop the headers

  • dtype – the column types to include or exclude. Default None else int, float, bool, object, ‘number’

  • exclude – to exclude or include the dtypes

  • regex – a regular expression to search the headers

  • re_ignore_case – true if the regex should ignore case. Default is False

  • precision – how many decimal places to set the return values. if None then the number is unchanged

  • fillna – { num_value, ‘mean’, ‘mode’, ‘median’ }. Default to np.nan - If num_value, then replaces NaN with this number value - If ‘mean’, then replaces NaN with the mean of the column - If ‘mode’, then replaces NaN with a mode of the column. random sample if more than 1 - If ‘median’, then replaces NaN with the median of the column

  • errors – {‘ignore’, ‘raise’, ‘coerce’}, default ‘coerce’ }. Default to ‘coerce’ - If ‘raise’, then invalid parsing will raise an exception - If ‘coerce’, then invalid parsing will be set as NaN - If ‘ignore’, then invalid parsing will return the input

  • inplace – if the passed pandas.DataFrame should be used or a deep copy

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.

to_int_type(df: ~pandas.core.frame.DataFrame, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, fillna=None, errors=None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>, None]

converts columns to int type

Parameters:
  • df – the Pandas.DataFrame to get the column headers from

  • headers – a list of headers to drop or filter on type

  • drop – to drop or not drop the headers

  • dtype – the column types to include or exclude. Default None else int, float, bool, object, ‘number’

  • exclude – to exclude or include the dtypes

  • regex – a regular expression to search the headers

  • re_ignore_case – true if the regex should ignore case. Default is False

  • fillna – { num_value, ‘mean’, ‘mode’, ‘median’ }. Default to 0 - If num_value, then replaces NaN with this number value - If ‘mean’, then replaces NaN with the mean of the column - If ‘mode’, then replaces NaN with a mode of the column. random sample if more than 1 - If ‘median’, then replaces NaN with the median of the column

  • errors – {‘ignore’, ‘raise’, ‘coerce’}, default ‘coerce’ - If ‘raise’, then invalid parsing will raise an exception - If ‘coerce’, then invalid parsing will be set as NaN - If ‘ignore’, then invalid parsing will return the input

  • inplace – if the passed pandas.DataFrame should be used or a deep copy

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.

to_list_type(df: ~pandas.core.frame.DataFrame, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, inplace: bool = None, fill_nulls: str = None, nulls_list: list = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>, None]

converts a string representation of a list into a list type

Parameters:
  • df – the Pandas.DataFrame to get the column headers from

  • headers – a list of headers to drop or filter on type

  • drop – to drop or not drop the headers

  • dtype – the column types to include or exclude. Default None else int, float, bool, object, ‘number’

  • exclude – to exclude or include the dtypes

  • regex – a regular expression to search the headers

  • re_ignore_case – true if the regex should ignore case. Default is False

  • fill_nulls – a value to fill nulls, default to an empty list

  • nulls_list – potential null values to replace.

  • inplace – if the passed pandas.DataFrame should be used or a deep copy

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.

to_numeric_type(df: ~pandas.core.frame.DataFrame, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, precision=None, fillna=None, errors=None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>, None]

converts columns to int type

Parameters:
  • df – the Pandas.DataFrame to get the column headers from

  • headers – a list of headers to drop or filter on type

  • drop – to drop or not drop the headers

  • dtype – the column types to include or exclude. Default None else int, float, bool, object, ‘number’

  • exclude – to exclude or include the dtypes

  • regex – a regular expression to search the headers

  • re_ignore_case – true if the regex should ignore case. Default is False

  • precision – how many decimal places to set the return values. if None then the number is unchanged

  • fillna – { num_value, ‘mean’, ‘mode’, ‘median’ }. Default to np.nan - If num_value, then replaces NaN with this number value. Must be a value not a string - If ‘mean’, then replaces NaN with the mean of the column - If ‘mode’, then replaces NaN with a mode of the column. random sample if more than 1 - If ‘median’, then replaces NaN with the median of the column

  • errors – {‘ignore’, ‘raise’, ‘coerce’}, default ‘coerce’ - If ‘raise’, then invalid parsing will raise an exception - If ‘coerce’, then invalid parsing will be set as NaN - If ‘ignore’, then invalid parsing will return the input

  • inplace – if the passed pandas.DataFrame should be used or a deep copy

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.

to_remove(df: ~pandas.core.frame.DataFrame, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>, None]

remove columns from the pandas.DataFrame

Parameters:
  • df – the Pandas.DataFrame to get the column headers from

  • headers – a list of headers to drop or filter on type

  • drop – to drop or not drop the headers

  • dtype – the column types to include or exclude. Default None else int, float, bool, object, ‘number’

  • exclude – to exclude or include the dtypes

  • regex – a regular expression to search the headers

  • re_ignore_case – true if the regex should ignore case. Default is False

  • inplace – if the passed pandas.DataFrame should be used or a deep copy

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.

to_sample(df, sample_size: [<class 'int'>, <class 'float'>], shuffle: bool = None, seed: int = None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

allows a certain sample size to be selected from the dataframe.

Parameters:
  • df – the pandas.DataFrame to drop duplicates from

  • sample_size – If float, should be between 0.0 and 1.0 and represent the proportion of the data set to return as a sample. If int, represents the absolute number of samples.

  • shuffle – (optional) if the canonical should be shuffled

  • seed – (optional) if shuffle is not None a seed value for the sample_size

  • inplace – if the passed pandas.DataFrame should be used or a deep copy

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.

to_select(df: ~pandas.core.frame.DataFrame, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>, None]

selects columns from the pandas.DataFrame

Parameters:
  • df – the Pandas.DataFrame to get the column headers from

  • headers – a list of headers to drop or filter on type

  • drop – to drop or not drop the headers

  • dtype – the column types to include or exclude. Default None else int, float, bool, object, ‘number’

  • exclude – to exclude or include the dtypes

  • regex – a regular expression to search the headers

  • re_ignore_case – true if the regex should ignore case. Default is False

  • inplace – if the passed pandas.DataFrame should be used or a deep copy

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.

to_str_type(df: ~pandas.core.frame.DataFrame, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, fixed_len_pad: str = None, use_string_type: bool = None, fill_nulls: str = None, nulls_list: list = None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>, None]

converts columns to str type

Parameters:
  • df – the Pandas.DataFrame to get the column headers from

  • headers – a list of headers to drop or filter on type

  • drop – to drop or not drop the headers

  • dtype – the column types to include or exclude. Default None else int, float, bool, object, ‘number’

  • exclude – to exclude or include the dtypes

  • regex – a regular expression to search the headers

  • re_ignore_case – true if the regex should ignore case. Default is False

  • fixed_len_pad – a padding character that when passed pads all values to the length of the longest

  • use_string_type – if the dtype ‘string’ should be used or keep as object type

  • fill_nulls – a value to fill nulls that then can be identified as a category type

  • nulls_list – potential null values to replace.

  • nulls_list – can be boolean or a list: if boolean and True then null_list equals [‘NaN’, ‘nan’, ‘null’, ‘’, ‘None’. np.nan, None] if list then this is considered potential null values.

  • inplace – if the passed pandas.DataFrame should be used or a deep copy

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.