IntentTransition Class
- class ds_discovery.intent.transition_intent.TransitionIntentModel(property_manager: ~ds_discovery.managers.transition_property_manager.TransitionPropertyManager, default_save_intent: bool = None, default_intent_level: [<class 'str'>, <class 'int'>, <class 'float'>] = None, order_next_available: bool = None, default_replace_intent: bool = None)
This component provides a set of actions that focuses on tidying raw data by removing data columns that are not useful to the final feature set, also known as data selection. These may include null columns, single value columns, duplicate columns and noise etc. We can also ensure the data is properly canonicalized through enforcing data typing.
- auto_clean_header(df, case=None, rename_map: [<class 'dict'>, <class 'str'>] = None, replace_spaces: str = None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)
clean the headers of a pandas DataFrame replacing space with underscore. If the rename_map is passed as a name of a connector contract
- Parameters:
df – the pandas.DataFrame to drop duplicates from
rename_map – a dict of name value pairs or connector name for column mapping
case – changes the headers to lower, upper, title, snake. if none of these then no change
replace_spaces – character to replace spaces with. Default is ‘_’ (underscore)
inplace – if the passed pandas.DataFrame should be used or a deep copy
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.
- auto_drop_columns(df, null_min: float = None, predominant_max: float = None, nulls_list: [<class 'bool'>, <class 'list'>] = None, drop_predominant: bool = None, drop_empty_row: bool = None, drop_unknown: bool = None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>, None]
auto removes columns that are at least 0.998 percent np.NaN, a single value, std equal zero or have a predominant value greater than the default 0.998 percent.
- Parameters:
df – the pandas.DataFrame to auto remove
null_min – the minimum number of null values default to 0.998 (99.8%) nulls
predominant_max – the percentage max a single field predominates default is 0.998 (99.8%) unique value
nulls_list – can be boolean or a list: if boolean and True then null_list equals [‘NaN’, ‘nan’, ‘null’, ‘’, ‘None’, ‘ ‘] if list then this is considered potential null values.
drop_predominant – drop columns that have a predominant value of the given predominant max
drop_empty_row – also drop any rows where all the values are empty
drop_unknown – (optional) drop objects that are not string types such as binary
inplace – if to change the passed pandas.DataFrame or return a copy (see return)
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.
uses ‘brute force’ techniques to removes highly correlated numeric columns based on the threshold, set by default to 0.998.
- Parameters:
df – data: the Canonical data to drop duplicates from
threshold – (optional) threshold correlation between columns. default 0.998
inplace – if the passed Canonical, should be used or a deep copy
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
if inplace, returns a formatted cleaner contract for this method, else a deep copy Canonical,.
- auto_drop_duplicates(df: ~pandas.core.frame.DataFrame, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>]
Removes columns that are duplicates of each other
- Parameters:
df – data: the Canonical data to drop duplicates from
inplace – if the passed Canonical, should be used or a deep copy
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
if inplace, returns a formatted cleaner contract for this method, else a deep copy Canonical,.
- auto_projection(df, headers: list = None, drop: bool = None, n_components: [<class 'int'>, <class 'float'>] = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None, **kwargs) DataFrame
Principal component analysis (PCA) is a linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space.
- Parameters:
df – a pd.DataFrame as the reference dataframe
headers – (optional) a list off headers so select (default) or drop from the dataset
drop – (optional) if True then srop the headers. False by default
n_components – (optional) Number of components to keep.
seed – (optional) placeholder
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
kwargs – additional parameters to pass the PCA model
- Returns:
a pd.DataFrame
- auto_reinstate_nulls(df, nulls_list=None, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>, None]
automatically reinstates nulls that have been masked with alternate values such as space or question-mark. By default, the nulls list is [‘’,’ ‘,’NaN’,’nan’,’None’,’null’,’Null’,’NULL’]
- Parameters:
df – the pandas DataFrame to remove null rows from
nulls_list – (optional) potential null values to replace with a null.
headers – a list of headers to drop or filter on type
drop – to drop or not drop the headers
dtype – the column types to include or exclude. Default None else int, float, bool, object, ‘number’
exclude – to exclude or include the dtypes
regex – a regular expression to search the headers
re_ignore_case – true if the regex should ignore case. Default is False
inplace – (optional)if the passed pandas.DataFrame should be used or a deep copy
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.
- auto_remove_null_rows(df, nulls_list: list = None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>, None]
automatically removes rows where the full row is null
- Parameters:
df – the pandas DataFrame to remove null rows from
nulls_list – (optional) potential null values to consider other than just np.nan
inplace – (optional) if the passed pandas.DataFrame should be used or a deep copy
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.
- auto_to_category(df: ~pandas.core.frame.DataFrame, unique_max: int = None, null_max: float = None, fill_nulls: str = None, nulls_list: list = None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>, None]
auto categorises columns that have a max number of uniqueness with a min number of nulls and are object dtype
- Parameters:
df – the pandas.DataFrame to auto categorise
unique_max – the max number of unique values in the column. default to 20
null_max – maximum number of null in the column between 0 and 1. default to 0.7 (70% nulls allowed)
fill_nulls – a value to fill nulls that then can be identified as a category type
nulls_list – potential null values to replace.
inplace – if the passed pandas.DataFrame should be used or a deep copy
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.
- auto_to_date(df: ~pandas.core.frame.DataFrame, timezone: str = None, day_first: bool = None, year_first: bool = None, date_format: str = None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)
looks through the dataset for valid date formats and converts them to a common datetime.
- Parameters:
df – the Pandas.DataFrame to get the column headers from
inplace – if the passed pandas.DataFrame should be used or a deep copy
timezone – set the timezone else data set to native
year_first – specifies if to parse with the year first If True parses dates with the year first, eg 10/11/12 is parsed as 2010-11-12. If both dayfirst and yearfirst are True, yearfirst is preceded (same as dateutil).
day_first – specifies if to parse with the day first If True, parses dates with the day first, eg %d-%m-%Y. If False default to the a preferred preference, normally %m-%d-%Y (but not strict)
date_format – if the date can’t be inferred uses date format eg format=’%Y%m%d’
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.
- auto_transition(df, unique_max: int = None, null_max: float = None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>, None]
automatically tries to convert a passes DataFrame to appropriate types
- Parameters:
df – the pandas DataFrame to remove null rows from
unique_max – the max number of unique values in the column. default to 20
null_max – maximum number of null in the column between 0 and 1. default to 0.7 (70% nulls allowed)
inplace – if the passed pandas.DataFrame should be used or a deep copy
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.
- run_intent_pipeline(canonical: ~pandas.core.frame.DataFrame, intent_levels: [<class 'int'>, <class 'str'>, <class 'list'>] = None, run_book: str = None, **kwargs)
Collectively runs all parameterised intent taken from the property manager against the code base as defined by the intent_contract.
It is expected that all intent methods have the ‘canonical’ as the first parameter of the method signature and will contain ‘inplace’ and ‘save_intent’ as parameters.
- Parameters:
canonical – this is the iterative value all intent are applied to and returned.
intent_levels – (optional) an single or list of levels to run, if list, run in order given
run_book – (optional) a preset runbook of intent_level to run in order
kwargs – additional kwargs to add to the parameterised intent, these will replace any that already exist
- Returns:
Canonical with parameterised intent applied or None if inplace is True
- to_bool_type(df: ~pandas.core.frame.DataFrame, bool_map: dict = None, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>, None]
converts column to bool based on the map
- Parameters:
df – the Pandas.DataFrame to get the column headers from
bool_map – a mapping of what to make True and False
headers – a list of headers to drop or filter on type
drop – to drop or not drop the headers
dtype – the column types to include or exclude. Default None else int, float, bool, object, ‘number’
exclude – to exclude or include the dtypes
regex – a regular expression to search the headers
re_ignore_case – true if the regex should ignore case. Default is False
inplace – if the passed pandas.DataFrame should be used or a deep copy
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.
- to_category_type(df: ~pandas.core.frame.DataFrame, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, inplace: bool = None, as_num: bool = None, fill_nulls: str = None, nulls_list: list = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>, None]
converts columns to categories
- Parameters:
df – the Pandas.DataFrame to get the column headers from
headers – a list of headers to drop or filter on type
drop – to drop or not drop the headers
dtype – the column types to include or exclude. Default None else int, float, bool, object, ‘number’
exclude – to exclude or include the dtypes
regex – a regular expression to search the headers
re_ignore_case – true if the regex should ignore case. Default is False
as_num – if true returns the category as a category code
fill_nulls – a value to fill nulls that then can be identified as a category type
nulls_list – potential null values to replace.
inplace – if the passed pandas.DataFrame should be used or a deep copy
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.
- to_date_element(df: ~pandas.core.frame.DataFrame, matrix: [<class 'str'>, <class 'list'>], headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, day_first: bool = None, year_first: bool = None, date_format: str = None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>]
breaks a date down into value representations of the various parts that date.
- Parameters:
df – the Pandas.DataFrame to get the column headers from
matrix – the matrix options (see below)
headers – a list of headers to drop or filter on type
drop – to drop or not drop the headers
dtype – the column types to include or exclude. Default None else int, float, bool, object, ‘number’
exclude – to exclude or include the dtypes
regex – a regular expression to search the headers
re_ignore_case – true if the regex should ignore case. Default is False
inplace – if the passed pandas.DataFrame should be used or a deep copy
year_first – specifies if to parse with the year first If True parses dates with the year first, eg 10/11/12 is parsed as 2010-11-12. If both dayfirst and yearfirst are True, yearfirst is preceded (same as dateutil).
day_first – specifies if to parse with the day first If True, parses dates with the day first, eg %d-%m-%Y. If False default to the a prefered preference, normally %m-%d-%Y (but not strict)
date_format – if the date can’t be inferred uses date format eg format=’%Y%m%d’
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.
Matrix options are: - yr: year - dec: decade - mon: month - day: day - dow: day of week - hr: hour - min: minute - woy: week of year = doy: day of year
- to_date_from_excel_type(df: ~pandas.core.frame.DataFrame, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>, None]
converts excel date formats into datetime
- Parameters:
df – the Pandas.DataFrame to get the column headers from
headers – a list of headers to drop or filter on type
drop – to drop or not drop the headers
dtype – the column types to include or exclude. Default None else int, float, bool, object, ‘number’
exclude – to exclude or include the dtypes
regex – a regular expression to search the headers
re_ignore_case – true if the regex should ignore case. Default is False
inplace – if the passed pandas.DataFrame should be used or a deep copy
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.
- to_date_type(df: ~pandas.core.frame.DataFrame, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, timezone: str = None, day_first: bool = None, year_first: bool = None, date_format: str = None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>]
converts columns to date types
- Parameters:
df – the Pandas.DataFrame to get the column headers from
headers – a list of headers to drop or filter on type
drop – to drop or not drop the headers
dtype – the column types to include or exclude. Default None else int, float, bool, object, ‘number’
exclude – to exclude or include the dtypes
regex – a regular expression to search the headers
re_ignore_case – true if the regex should ignore case. Default is False
inplace – if the passed pandas.DataFrame should be used or a deep copy
timezone – set the timezone else data set to native
year_first – specifies if to parse with the year first If True parses dates with the year first, eg 10/11/12 is parsed as 2010-11-12. If both dayfirst and yearfirst are True, yearfirst is preceded (same as dateutil).
day_first – specifies if to parse with the day first If True, parses dates with the day first, eg %d-%m-%Y. If False default to the a prefered preference, normally %m-%d-%Y (but not strict)
date_format – if the date can’t be inferred uses date format eg format=’%Y%m%d’
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.
- to_float_type(df: ~pandas.core.frame.DataFrame, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, precision=None, fillna=None, errors=None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>, None]
converts columns to float type
- Parameters:
df – the Pandas.DataFrame to get the column headers from
headers – a list of headers to drop or filter on type
drop – to drop or not drop the headers
dtype – the column types to include or exclude. Default None else int, float, bool, object, ‘number’
exclude – to exclude or include the dtypes
regex – a regular expression to search the headers
re_ignore_case – true if the regex should ignore case. Default is False
precision – how many decimal places to set the return values. if None then the number is unchanged
fillna – { num_value, ‘mean’, ‘mode’, ‘median’ }. Default to np.nan - If num_value, then replaces NaN with this number value - If ‘mean’, then replaces NaN with the mean of the column - If ‘mode’, then replaces NaN with a mode of the column. random sample if more than 1 - If ‘median’, then replaces NaN with the median of the column
errors – {‘ignore’, ‘raise’, ‘coerce’}, default ‘coerce’ }. Default to ‘coerce’ - If ‘raise’, then invalid parsing will raise an exception - If ‘coerce’, then invalid parsing will be set as NaN - If ‘ignore’, then invalid parsing will return the input
inplace – if the passed pandas.DataFrame should be used or a deep copy
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.
- to_int_type(df: ~pandas.core.frame.DataFrame, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, fillna=None, errors=None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>, None]
converts columns to int type
- Parameters:
df – the Pandas.DataFrame to get the column headers from
headers – a list of headers to drop or filter on type
drop – to drop or not drop the headers
dtype – the column types to include or exclude. Default None else int, float, bool, object, ‘number’
exclude – to exclude or include the dtypes
regex – a regular expression to search the headers
re_ignore_case – true if the regex should ignore case. Default is False
fillna – { num_value, ‘mean’, ‘mode’, ‘median’ }. Default to 0 - If num_value, then replaces NaN with this number value - If ‘mean’, then replaces NaN with the mean of the column - If ‘mode’, then replaces NaN with a mode of the column. random sample if more than 1 - If ‘median’, then replaces NaN with the median of the column
errors – {‘ignore’, ‘raise’, ‘coerce’}, default ‘coerce’ - If ‘raise’, then invalid parsing will raise an exception - If ‘coerce’, then invalid parsing will be set as NaN - If ‘ignore’, then invalid parsing will return the input
inplace – if the passed pandas.DataFrame should be used or a deep copy
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.
- to_list_type(df: ~pandas.core.frame.DataFrame, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, inplace: bool = None, fill_nulls: str = None, nulls_list: list = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>, None]
converts a string representation of a list into a list type
- Parameters:
df – the Pandas.DataFrame to get the column headers from
headers – a list of headers to drop or filter on type
drop – to drop or not drop the headers
dtype – the column types to include or exclude. Default None else int, float, bool, object, ‘number’
exclude – to exclude or include the dtypes
regex – a regular expression to search the headers
re_ignore_case – true if the regex should ignore case. Default is False
fill_nulls – a value to fill nulls, default to an empty list
nulls_list – potential null values to replace.
inplace – if the passed pandas.DataFrame should be used or a deep copy
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.
- to_numeric_type(df: ~pandas.core.frame.DataFrame, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, precision=None, fillna=None, errors=None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>, None]
converts columns to int type
- Parameters:
df – the Pandas.DataFrame to get the column headers from
headers – a list of headers to drop or filter on type
drop – to drop or not drop the headers
dtype – the column types to include or exclude. Default None else int, float, bool, object, ‘number’
exclude – to exclude or include the dtypes
regex – a regular expression to search the headers
re_ignore_case – true if the regex should ignore case. Default is False
precision – how many decimal places to set the return values. if None then the number is unchanged
fillna – { num_value, ‘mean’, ‘mode’, ‘median’ }. Default to np.nan - If num_value, then replaces NaN with this number value. Must be a value not a string - If ‘mean’, then replaces NaN with the mean of the column - If ‘mode’, then replaces NaN with a mode of the column. random sample if more than 1 - If ‘median’, then replaces NaN with the median of the column
errors – {‘ignore’, ‘raise’, ‘coerce’}, default ‘coerce’ - If ‘raise’, then invalid parsing will raise an exception - If ‘coerce’, then invalid parsing will be set as NaN - If ‘ignore’, then invalid parsing will return the input
inplace – if the passed pandas.DataFrame should be used or a deep copy
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.
- to_remove(df: ~pandas.core.frame.DataFrame, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>, None]
remove columns from the pandas.DataFrame
- Parameters:
df – the Pandas.DataFrame to get the column headers from
headers – a list of headers to drop or filter on type
drop – to drop or not drop the headers
dtype – the column types to include or exclude. Default None else int, float, bool, object, ‘number’
exclude – to exclude or include the dtypes
regex – a regular expression to search the headers
re_ignore_case – true if the regex should ignore case. Default is False
inplace – if the passed pandas.DataFrame should be used or a deep copy
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.
- to_sample(df, sample_size: [<class 'int'>, <class 'float'>], shuffle: bool = None, seed: int = None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)
allows a certain sample size to be selected from the dataframe.
- Parameters:
df – the pandas.DataFrame to drop duplicates from
sample_size – If float, should be between 0.0 and 1.0 and represent the proportion of the data set to return as a sample. If int, represents the absolute number of samples.
shuffle – (optional) if the canonical should be shuffled
seed – (optional) if shuffle is not None a seed value for the sample_size
inplace – if the passed pandas.DataFrame should be used or a deep copy
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.
- to_select(df: ~pandas.core.frame.DataFrame, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>, None]
selects columns from the pandas.DataFrame
- Parameters:
df – the Pandas.DataFrame to get the column headers from
headers – a list of headers to drop or filter on type
drop – to drop or not drop the headers
dtype – the column types to include or exclude. Default None else int, float, bool, object, ‘number’
exclude – to exclude or include the dtypes
regex – a regular expression to search the headers
re_ignore_case – true if the regex should ignore case. Default is False
inplace – if the passed pandas.DataFrame should be used or a deep copy
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.
- to_str_type(df: ~pandas.core.frame.DataFrame, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, fixed_len_pad: str = None, use_string_type: bool = None, fill_nulls: str = None, nulls_list: list = None, inplace: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>, None]
converts columns to str type
- Parameters:
df – the Pandas.DataFrame to get the column headers from
headers – a list of headers to drop or filter on type
drop – to drop or not drop the headers
dtype – the column types to include or exclude. Default None else int, float, bool, object, ‘number’
exclude – to exclude or include the dtypes
regex – a regular expression to search the headers
re_ignore_case – true if the regex should ignore case. Default is False
fixed_len_pad – a padding character that when passed pads all values to the length of the longest
use_string_type – if the dtype ‘string’ should be used or keep as object type
fill_nulls – a value to fill nulls that then can be identified as a category type
nulls_list – potential null values to replace.
nulls_list – can be boolean or a list: if boolean and True then null_list equals [‘NaN’, ‘nan’, ‘null’, ‘’, ‘None’. np.nan, None] if list then this is considered potential null values.
inplace – if the passed pandas.DataFrame should be used or a deep copy
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.