IntentWrangle Class
- class ds_discovery.intent.wrangle_intent.WrangleIntentModel(property_manager: [<class 'ds_discovery.managers.wrangle_property_manager.WranglePropertyManager'>, <class 'ds_discovery.managers.synthetic_property_manager.SyntheticPropertyManager'>], default_save_intent: bool = None, default_intent_level: [<class 'str'>, <class 'int'>, <class 'float'>] = None, order_next_available: bool = None, default_replace_intent: bool = None)
This component provides a set of actions that focuses on data and feature engineering. The class contains a number of transformers to engineer features to use in machine learning models or statistical analysis
- correlate_activation(canonical: ~typing.Any, header: str, activation: str = None, precision: int = None, seed: int = None, rtn_type: str = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)
Activation functions play a crucial role in the backpropagation algorithm, which is the primary algorithm used for training neural networks. During backpropagation, the error of the output is propagated backwards through the network, and the weights of the network are updated based on this error. The activation function is used to introduce non-linearity into the output of a neural network layer.
Logistic Sigmoid a.k.a logit, tmaps any input value to a value between 0 and 1, making it useful for binary classification problems and is defined as f(x) = 1/(1+exp(-x))
Tangent Hyperbolic (tanh) function is a shifted and stretched version of the Sigmoid function but maps the input values to a range between -1 and 1. and is defined as f(x) = (exp(x)-exp(-x))/(exp(x)+exp(-x))
Rectified Linear Unit (ReLU) function. is the most popular activation function, which replaces negative values with zero and keeps the positive values unchanged. and is defined as f(x) = x * (x > 0)
- Parameters:
canonical – a pd.DataFrame as the reference dataframe
header – the header in the DataFrame to correlate
activation – (optional) the name of the activation function. Options ‘sigmoid’, ‘tanh’ and ‘relu’
precision – (optional) how many decimal places. default to 3
seed – (optional) the random seed. defaults to current datetime
rtn_type – (optional) changes the default return of a ‘list’ to a pd.Series other than the int, float, category, string and object, passing ‘as-is’ will return as is
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
an equal length list of correlated values
- correlate_aggregate(canonical: ~typing.Any, headers: list, agg: str, seed: int = None, rtn_type: str = None, save_intent: bool = None, precision: int = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)
correlate two or more columns with each other through a finite set of aggregation functions. The aggregation function names are limited to ‘sum’, ‘prod’, ‘count’, ‘min’, ‘max’ and ‘mean’ for numeric columns and a special ‘list’ function name to combine the columns as a list
- Parameters:
canonical – a direct or generated pd.DataFrame. see context notes below
headers – a list of headers to correlate
agg – the aggregation function name enact. The available functions are: ‘sum’, ‘prod’, ‘count’, ‘min’, ‘max’, ‘mean’ and ‘list’ which combines the columns as a list
precision – the value precision of the return values
seed – (optional) a seed value for the random function: default to None
rtn_type – (optional) changes the default return of a ‘list’ to a pd.Series other than the int, float, category, string and object, passing ‘as-is’ will return as is
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
a list of equal length to the one passed
- correlate_categories(canonical: ~typing.Any, header: str, correlations: list, actions: dict, default_action: [<class 'str'>, <class 'int'>, <class 'float'>, <class 'dict'>] = None, rtn_type: str = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)
correlation of a set of values to an action, the correlations must map to the dictionary index values. Note. to use the current value in the passed values as a parameter value pass an empty dict {} as the keys value. If you want the action value to be the current value of the passed value then again pass an empty dict action to be the current value.
simple correlation list:
['A', 'B', 'C'] # if values is 'A' then action is 0 and so on
multiple choice correlation
[['A','B'], 'C'] # if values is 'A' OR 'B' then action is 0 and so on
actions dictionary where the method is a class method followed by its parameters
{0: {'method': 'get_numbers', 'from_value': 0, to_value: 27}}
you can also use the action to specify a specific value:
{0: 'F', 1: {'method': 'get_numbers', 'from_value': 0, to_value: 27}}
- Parameters:
canonical – a direct or generated pd.DataFrame. see context notes below
header – the header in the DataFrame to correlate
correlations – a list of categories (can also contain lists for multiple correlations.
actions – the correlated set of categories that should map to the index
default_action – (optional) a default action to take if the selection is not fulfilled
seed – a seed value for the random function: default to None
rtn_type – (optional) changes the default return of a ‘list’ to a pd.Series other than the int, float, category, string and object, passing ‘as-is’ will return as is
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
a list of equal length to the one passed
Actions are the resulting outcome of the selection (or the default). An action can be just a value or a dict that executes a intent method such as get_number(). To help build actions there is a helper function called action2dict(…) that takes a method as a mandatory attribute.
- With actions there are special keyword ‘method’ values:
@header: use a column as the value reference, expects the ‘header’ key
@constant: use a value constant, expects the key ‘value’
@sample: use to get sample values, expected ‘name’ of the Sample method, optional ‘shuffle’ boolean
@eval: evaluate a code string, expects the key ‘code_str’ and any locals() required
An example of a simple action to return a selection from a list:
{'method': 'get_category', selection=['M', 'F', 'U']
an example of using the helper method, in this example we use the keyword @header to get a value from another column at the same index position:
inst.action2dict(method="@header", header='value')
We can even execute some sort of evaluation at run time:
inst.action2dict(method="@eval", code_str='sum(values)', values=[1,4,2,1])
- correlate_custom(canonical: ~typing.Any, code_str: str, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None, **kwargs)
Commonly used for custom list comprehension, takes code string that when evaluated returns a list of values Before using this method, consider the method correlate_selection(…)
When referencing the canonical in the code_str it should be referenced either by use parameter label ‘canonical’ or the short cut ‘@’ symbol. for example:
code_str = "[x + 2 for x in @['A']]" # where 'A' is a header in the canonical
kwargs can also be passed into the code string but must be preceded by a ‘$’ symbol for example:
code_str = "[True if x == $v1 else False for x in @['A']]" # where 'v1' is a kwargs
- Parameters:
canonical – a pd.DataFrame as the reference dataframe
code_str – an action on those column values. to reference the canonical use ‘@’
seed – (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
kwargs – a set of kwargs to include in any executable function
- Returns:
value set based on the selection list and the action
- correlate_date_diff(canonical: ~typing.Any, first_date: str, second_date: str, units: str = None, precision: int = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None, **kwargs)
returns a column for the difference between a primary and secondary date where the primary is an early date than the secondary.
- Parameters:
canonical – the DataFrame containing the column headers
first_date – the primary or older date field
second_date – the secondary or newer date field
units – (optional) The Timedelta units e.g. ‘D’, ‘W’, ‘M’, ‘Y’. default is ‘D’
precision – the precision of the result
seed – (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
kwargs – a set of kwargs to include in any executable function
- Returns:
value set based on the selection list and the action
- correlate_dates(canonical: ~typing.Any, header: str, choice: [<class 'int'>, <class 'float'>, <class 'str'>] = None, choice_header: str = None, offset: [<class 'int'>, <class 'dict'>, <class 'str'>] = None, jitter: [<class 'int'>, <class 'str'>] = None, jitter_units: str = None, ignore_time: bool = None, ignore_seconds: bool = None, min_date: str = None, max_date: str = None, now_delta: str = None, date_format: str = None, day_first: bool = None, year_first: bool = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)
correlate a list of continuous dates adjusting those dates, or a subset of those dates, with a normalised jitter along with a value offset.
choice
,jitter
andoffset
can accept environment variable string names starting with${
and ending with}
.When using offset and a dict is passed, the dict should take the form {‘days’: 1}, where the unit is plural, to add 1 day or a singular name {‘hour’: 3}, where the unit is singular, to replace the current with 3 hours. Offsets can be ‘years’, ‘months’, ‘weeks’, ‘days’, ‘hours’, ‘minutes’ or ‘seconds’. If an int is passed days are assumed.
- Parameters:
canonical – a pd.DataFrame as the reference dataframe
header – the header in the DataFrame to correlate
choice – (optional) The number of values or percentage between 0 and 1 to choose.
choice_header – (optional) those not chosen are given the values of the given header
offset – (optional) Temporal parameter that add to or replace the offset value. if int then assume ‘days’
jitter – (optional) the random jitter or deviation in days
jitter_units – (optional) the units of the jitter, Options: ‘W’, ‘D’, ‘h’, ‘m’, ‘s’. default ‘D’
ignore_time – ignore time elements and only select from Year, Month, Day elements. Default is False
ignore_seconds – ignore second elements and only select from Year to minute elements. Default is False
min_date – (optional)a minimum date not to go below
max_date – (optional)a max date not to go above
now_delta – (optional) returns a delta from now as an int list, Options: ‘Y’, ‘M’, ‘W’, ‘D’, ‘h’, ‘m’, ‘s’
day_first – (optional) if the dates given are day first firmat. Default to True
year_first – (optional) if the dates given are year first. Default to False
date_format – (optional) the format of the output
seed – (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
a list of equal size to that given
- correlate_discrete_intervals(canonical: ~typing.Any, header: str, granularity: [<class 'int'>, <class 'float'>, <class 'list'>] = None, lower: [<class 'int'>, <class 'float'>] = None, upper: [<class 'int'>, <class 'float'>] = None, categories: list = None, precision: int = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)
converts continuous representation into discrete representation through interval categorisation
- Parameters:
canonical – a pd.DataFrame as the reference dataframe
header – the header in the DataFrame to correlate
granularity – (optional) the granularity of the analysis across the range. Default is 3 - int passed - represents the number of periods - float passed - the length of each interval - list[tuple] - specific interval periods e.g [] -list[float] - the percentile or quantities, All should fall between 0 and 1
lower – (optional) the lower limit of the number value. Default min()
upper – (optional) the upper limit of the number value. Default max()
precision – (optional) The precision of the range and boundary values. by default set to 5.
categories – (optional) a set of labels the same length as the intervals to name the categories
seed – seed: (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
a list of equal length to the one passed
- correlate_join(canonical: ~typing.Any, header: str, action: [<class 'str'>, <class 'dict'>], sep: str = None, rtn_type: str = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)
correlate a column and join it with the result of the action, This allows for composite values to be build from. an example might be to take a forename and add the surname with a space separator to create a composite name field, of to join two primary keys to create a single composite key.
- Parameters:
canonical – a direct or generated pd.DataFrame. see context notes below
header – an ordered list of columns to join
action – (optional) a string or a single action whose outcome will be joined to the header value
sep – (optional) a separator between the values
seed – (optional) a seed value for the random function: default to None
rtn_type – (optional) changes the default return of a ‘list’ to a pd.Series - other than the int, float, category, string and object, passing ‘as-is’ will return as is
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
a list of equal length to the one passed
Actions are the resulting outcome of the selection (or the default). An action can be just a value or a dict that executes a intent method such as get_number(). To help build actions there is a helper function called action2dict(…) that takes a method as a mandatory attribute.
- With actions there are special keyword ‘method’ values:
@header: use a column as the value reference, expects the ‘header’ key
@constant: use a value constant, expects the key ‘value’
@sample: use to get sample values, expected ‘name’ of the Sample method, optional ‘shuffle’ boolean
@eval: evaluate a code string, expects the key ‘code_str’ and any locals() required
An example of a simple action to return a selection from a list:
{'method': 'get_category', selection=['M', 'F', 'U']
an example of using the helper method, in this example we use the keyword @header to get a value from another column at the same index position:
inst.action2dict(method="@header", header='value')
We can even execute some sort of evaluation at run time:
inst.action2dict(method="@eval", code_str='sum(values)', values=[1,4,2,1])
- correlate_list_element(canonical: ~typing.Any, header: str, list_size: int = None, random_choice: bool = None, replace: bool = None, shuffle: bool = None, convert_str: bool = None, rtn_type: str = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)
correlate a column where the elements of the columns contains a list, and a choice is taken from that list. if the list_size == 1 then a single value is correlated otherwise a list is correlated
Null values are passed through but all other elements must be a list with at least 1 value in.
if ‘random’ is true then all returned values will be a random selection from the list and of equal length. if ‘random’ is false then each list will not exceed the ‘list_size’
Also if ‘random’ is true and ‘replace’ is False then all lists must have more elements than the list_size. By default ‘replace’ is True and ‘shuffle’ is False.
In addition ‘convert_str’ allows lists that have been formatted as a string can be converted from a string to a list using ‘ast.literal_eval(x)’
- Parameters:
canonical – a direct or generated pd.DataFrame. see context notes below
header – The header containing a list to chose from.
list_size – (optional) the number of elements to return, if more than 1 then list
random_choice – (optional) if the choice should be a random choice.
replace – (optional) if the choice selection should be replaced or selected only once
shuffle – (optional) if the final list should be shuffled
convert_str – if the header has the list as a string convert to list using ast.literal_eval()
seed – (optional) a seed value for the random function: default to None
rtn_type – (optional) changes the default return of a ‘list’ to a pd.Series other than the int, float, category, string and object, passing ‘as-is’ will return as is
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
a list of equal length to the one passed
- correlate_mark_outliers(canonical: ~typing.Any, header: str, measure: [<class 'int'>, <class 'float'>] = None, method: str = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)
Drops rows in the canonical where the values are deemed outliers based on the method and measure. There are three selectable methods of choice, interquartile or empirical, of which interquartile is the default.
The ‘empirical’ rule states that for a normal distribution, nearly all of the data will fall within three standard deviations of the mean. Given mu and sigma, a simple way to identify outliers is to compute a z-score for every value, which is defined as the number of standard deviations away a value is from the mean. therefor measure given should be the z-score or the number of standard deviations away a value is from the mean. The 68–95–99.7 rule, guide the percentage of values that lie within a band around the mean in a normal distribution with a width of two, four and six standard deviations, respectively and thus the choice of z-score
For the ‘interquartile’ range (IQR), also called the midspread, middle 50%, or H‑spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles of a sample set. The IQR can be used to identify outliers by defining limits on the sample values that are a factor k of the IQR below the 25th percentile or above the 75th percentile. The common value for the factor k is 1.5. A factor k of 3 or more can be used to identify values that are extreme outliers.
- param canonical:
a pd.DataFrame as the reference dataframe
- param header:
the header in the DataFrame to correlate
- param method:
(optional) A method to run to identify outliers. interquartile (default) or empirical
- param measure:
(optional) A measure against each method, respectively factor k, z-score, quartile (see above)
- param seed:
(optional) the random seed
- param save_intent:
(optional) if the intent contract should be saved to the property manager
- param column_name:
(optional) the column name that groups intent to create a column
- param intent_order:
(optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
- param replace_intent:
(optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
- param remove_duplicates:
(optional) removes any duplicate intent in any level that is identical
- return:
an equal length list of correlated values
- correlate_missing(canonical: ~typing.Any, header: str, method: str = None, weights: str = None, constant: ~typing.Any = None, precision: int = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)
imputes missing data with statistical estimates of the missing values. The methods are ‘mean’, ‘median’, ‘mode’ and ‘random’ with the addition of ‘constant’ and ‘indicator’
Mean/median imputation consists of replacing all occurrences of missing values (NA) within a variable by the mean (if the variable has a Gaussian distribution) or median (if the variable has a skewed distribution). Can only be applied to numeric values.
Mode imputation consists of replacing all occurrences of missing values (NA) within a variable by the mode, which is the most frequent value or most frequent category. Can be applied to both numerical and categorical variables.
Random sampling imputation is in principle similar to mean, median, and mode imputation in that it considers that missing values, should look like those already existing in the distribution. Random sampling consists of taking random observations from the pool of available data and using them to replace the NA. In random sample imputation, we take as many random observations as missing values exist in the variable. Can be applied to both numerical and categorical variables.
Neighbour imputation is for filling in missing values using the k-Nearest Neighbors approach. Each missing feature is imputed using values from five nearest neighbors that have a value for the feature. The feature of the neighbors are averaged uniformly or weighted by distance to each neighbor. If a sample has more than one feature missing, then the neighbors for that sample can be different depending on the particular feature being imputed. When the number of available neighbors is less than five the average for that feature is used during imputation. If there is at least one neighbor with a defined distance, the weighted or unweighted average of the remaining neighbors will be used during imputation.
Constant or Arbitrary value imputation consists of replacing all occurrences of missing values (NA) with an arbitrary constant value. Can be applied to both numerical and categorical variables. A value must be passed in the constant parameter relevant to the column type.
Indicator is not an imputation method but imputation techniques, such as mean, median and random will affect the variable distribution quite dramatically and is a good idea to flag them with a missing indicator. This must be done before imputation of the column.
- Parameters:
canonical – a pd.DataFrame as the reference dataframe
header – the header in the DataFrame to correlate
method – (optional) the replacement method, ‘mean’, ‘median’, ‘mode’, ‘constant’, ‘random’, ‘indicator’
weights – (optional) Weight function used in prediction of nearest neighbour if used as method. Options ‘uniform’ : uniform weights. All points in each neighborhood are weighted equally. ‘distance’ : weight points by the inverse of their distance.
constant – (optional) a value to us when the method is constant
precision – (optional) if numeric, the precision of the outcome, by default set to 3.
seed – (optional) the random seed. defaults to current datetime
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
an equal length list of correlated values
- correlate_numbers(canonical: ~typing.Any, header: str, standardize: bool = None, normalize: bool = None, scalar: tuple = None, transform: str = None, precision: int = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)
Allows for the scaling transformation of a continuous value set. scaling methods. Thse techniques are used to alter the values of a variable so that they are expressed on a common scale. This is often done to make it easier to compare different variables or to make it easier to analyze data.
- Parameters:
canonical – a pd.DataFrame as the reference dataframe
header – the header in the DataFrame to correlate
standardize – (optional) standardise continuous variables with mean 0 and std 1
normalize – (optional) normalize continuous variables between 0 an 1.
scalar – (optional) scales continuous variables between a mix and max value passed in the tuple pair.
transform – (optional) attempts normal distribution of continuous variables. options are log, sqrt, cbrt, boxcox, yeojohnson
precision – (optional) how many decimal places. default to 3
seed – (optional) the random seed. defaults to current datetime
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
an equal length list of correlated values
- Returns:
an equal length list of correlated values
The offset can be a numeric offset that is added to the value, e.g. passing 2 will add 2 to all values. If a string is passed if format should be a calculation with the ‘@’ character used to represent the column value. e.g.
'1-@' would subtract the column value from 1, '@*0.5' would multiply the column value by 0.5
- correlate_polynomial(canonical: ~typing.Any, header: str, coefficient: list, rtn_type: str = None, seed: int = None, keep_zero: bool = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)
creates a polynomial using the reference header values and apply the coefficients where the index of the list represents the degree of the term in reverse order.
e.g [6, -2, 0, 4] => f(x) = 4x**3 - 2x + 6
- Parameters:
canonical – a direct or generated pd.DataFrame. see context notes below
header – the header in the DataFrame to correlate
coefficient – the reverse list of term coefficients
seed – (optional) the random seed. defaults to current datetime
rtn_type – (optional) changes the default return of a ‘list’ to a pd.Series other than the int, float, category, string and object, passing ‘as-is’ will return as is
keep_zero – (optional) if True then zeros passed remain zero, Default is False
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
an equal length list of correlated values
- correlate_selection(canonical: ~typing.Any, selection: list, action: [<class 'str'>, <class 'int'>, <class 'float'>, <class 'dict'>], default_action: [<class 'str'>, <class 'int'>, <class 'float'>, <class 'dict'>] = None, seed: int = None, rtn_type: str = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)
returns a value set based on the selection list and the action enacted on that selection. If the selection criteria is not fulfilled then the default_action is taken if specified, else null value.
If a DataFrame is not passed, the values column is referenced by the header ‘_default’
- Parameters:
canonical – a direct or generated pd.DataFrame. see context notes below
selection – a list of selections where conditions are filtered on, executed in list order
An example of a selection with the minimum requirements is: (see ‘select2dict(…)’)
[{'column': 'genre', 'condition': "=='Comedy'"}]
- Parameters:
action – a value or dict to act upon if the select is successful. see below for more examples
An example of an action as a dict: (see ‘action2dict(…)’)
{'method': 'get_category', 'selection': ['M', 'F', 'U']}
- Parameters:
default_action – (optional) a default action to take if the selection is not fulfilled
seed – (optional) a seed value for the random function: default to None
rtn_type – (optional) changes the default return of a ‘list’ to a pd.Series - other than the int, float, category, string and object, passing ‘as-is’ will return as is
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
value set based on the selection list and the action
Selections are a list of dictionaries of conditions and optional additional parameters to filter. To help build conditions there is a static helper method called ‘select2dict(…)’ that has parameter options available to build a condition. An example of a condition with the minimum requirements is [{‘column’: ‘genre’, ‘condition’: “==’Comedy’”}]
an example of using the helper method
selection = [inst.select2dict(column='gender', condition="=='M'"), inst.select2dict(column='age', condition=">65", logic='XOR')]
Using the ‘select2dict’ method ensure the correct keys are used and the dictionary is properly formed. It also helps with building the logic that is executed in order
Actions are the resulting outcome of the selection (or the default). An action can be just a value or a dict that executes a intent method such as get_number(). To help build actions there is a helper function called action2dict(…) that takes a method as a mandatory attribute.
- With actions there are special keyword ‘method’ values:
@header: use a column as the value reference, expects the ‘header’ key
@constant: use a value constant, expects the key ‘value’
@sample: use to get sample values, expected ‘name’ of the Sample method, optional ‘shuffle’ boolean
@eval: evaluate a code string, expects the key ‘code_str’ and any locals() required
An example of a simple action to return a selection from a list:
{'method': 'get_category', selection: ['M', 'F', 'U']}
This same action using the helper method would look like:
inst.action2dict(method='get_category', selection=['M', 'F', 'U'])
an example of using the helper method, in this example we use the keyword @header to get a value from another column at the same index position:
inst.action2dict(method="@header", header='value')
We can even execute some sort of evaluation at run time:
inst.action2dict(method="@eval", code_str='sum(values)', values=[1,4,2,1])
- correlate_values(canonical: ~typing.Any, header: str, choice: [<class 'int'>, <class 'float'>, <class 'str'>] = None, choice_header: str = None, precision: int = None, jitter: [<class 'int'>, <class 'float'>, <class 'str'>] = None, offset: [<class 'int'>, <class 'float'>, <class 'str'>] = None, transform: ~typing.Any = None, lower: [<class 'int'>, <class 'float'>] = None, upper: [<class 'int'>, <class 'float'>] = None, keep_zero: bool = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) list
correlate a list of continuous values adjusting those values, or a subset of those values, with a normalised jitter (std from the value) along with a value offset.
choice
,jitter
andoffset
can accept environment variable string names starting with${
and ending with}
.- Parameters:
canonical – a pd.DataFrame as the reference dataframe
header – the header in the DataFrame to correlate
choice – (optional) The number of values to choose to apply the change to. Can be an environment variable.
choice_header – (optional) those not chosen are given the values of the given header
precision – (optional) to what precision the return values should be
offset – (optional) a fixed value to offset or if str an operation to perform using @ as the header value.
transform – (optional) passing a lambda function to transform the value. e.g.
lambda x: (x - 3) / 2
jitter – (optional) a perturbation of the value where the jitter is a random normally distributed std
precision – (optional) how many decimal places. default to 3
seed – (optional) the random seed. defaults to current datetime
keep_zero – (optional) if True then zeros passed remain zero despite a change, Default is False
lower – a minimum value not to go below
upper – a max value not to go above
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
an equal length list of correlated values
- frame_selection(canonical: ~typing.Any, selection: list = None, choice: int = None, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) DataFrame
Selects rows and/or columns changing the shape of the DatFrame. This is always run last in a pipeline Rows are filtered before the column filter so columns can be referenced even though they might not be included the final column list.
- Parameters:
canonical – a direct or generated pd.DataFrame. see context notes below
selection – a list of selections where conditions are filtered on, executed in list order
An example of a selection with the minimum requirements is: (see ‘select2dict(…)’)
[{'column': 'genre', 'condition': "=='Comedy'"}]
- Parameters:
choice – a number of rows to select, randomly selected from the index
headers – a list of headers to drop or filter on type
drop – to drop or not drop the headers
dtype – the column types to include or excluse. Default None else int, float, bool, object, ‘number’
exclude – to exclude or include the dtypes
regex – a regular expression to search the headers. example ‘^((?!_amt).)*$)’ excludes ‘_amt’ columns
re_ignore_case – true if the regex should ignore case. Default is False
seed – this is a place holder, here for compatibility across methods
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
pd.DataFrame
Selections are a list of dictionaries of conditions and optional additional parameters to filter. To help build conditions there is a static helper method called ‘select2dict(…)’ that has parameter options available to build a condition. An example of a condition with the minimum requirements is [{‘column’: ‘genre’, ‘condition’: “==’Comedy’”}]
an example of using the helper method
selection = [inst.select2dict(column='gender', condition="=='M'"), inst.select2dict(column='age', condition=">65", logic='XOR')]
Using the ‘select2dict’ method ensure the correct keys are used and the dictionary is properly formed. It also helps with building the logic that is executed in order
- frame_starter(canonical: ~typing.Any, selection: list = None, choice: int = None, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, rename_map: dict = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) DataFrame
Selects rows and/or columns changing the shape of the DatFrame. This is always run first in a pipeline Rows are filtered before columns are filtered so columns can be referenced even though they might not be included in the final column list.
- Parameters:
canonical – a direct or generated pd.DataFrame. see context notes below
choice – a number of rows to select, randomly selected from the index
selection – a list of selections where conditions are filtered on, executed in list order
An example of a selection with the minimum requirements is: (see ‘select2dict(…)’)
[{'column': 'genre', 'condition': "=='Comedy'"}]
- Parameters:
headers – a list of headers to drop or filter on type
drop – to drop or not drop the headers
dtype – the column types to include or excluse. Default None else int, float, bool, object, ‘number’
exclude – to exclude or include the dtypes
regex – a regular expression to search the headers. example ‘^((?!_amt).)*$)’ excludes ‘_amt’ columns
re_ignore_case – true if the regex should ignore case. Default is False
seed – this is a place holder, here for compatibility across methods
rename_map – a from: to dictionary of headers to rename
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
pd.DataFrame
The canonical is a pd.DataFrame, a pd.Series or list, a connector contract str reference or a set of parameter instructions on how to generate a pd.Dataframe. the description of each is:
pd.Dataframe -> a deep copy of the pd.DataFrame
pd.Series or list -> creates a pd.DataFrameof one column with the ‘header’ name or ‘default’ if not given
str -> instantiates a connector handler with the connector_name and loads the DataFrame from the connection
int -> generates an empty pd.Dataframe with an index size of the int passed.
dict -> use canonical2dict(…) to help construct a dict with a ‘method’ to build a pd.DataFrame
- - model_*(...) -> one of the SyntheticBuilder model methods and parameters
- - @empty -> generates an empty pd.DataFrame where size and headers can be passed
:size sets the index size of the dataframe :headers any initial headers for the dataframe
- - @generate -> generate a synthetic file from a remote Domain Contract
:task_name the name of the SyntheticBuilder task to run :repo_uri the location of the Domain Product :size (optional) a size to generate :seed (optional) if a seed should be applied :run_book (optional) if specific intent should be run only
Selections are a list of dictionaries of conditions and optional additional parameters to filter. To help build conditions there is a static helper method called ‘select2dict(…)’ that has parameter options available to build a condition. An example of a condition with the minimum requirements is [{‘column’: ‘genre’, ‘condition’: “==’Comedy’”}]
an example of using the helper method
selection = [inst.select2dict(column='gender', condition="=='M'"), inst.select2dict(column='age', condition=">65", logic='XOR')]
Using the ‘select2dict’ method ensure the correct keys are used and the dictionary is properly formed. It also helps with building the logic that is executed in order
- model_concat(canonical: ~typing.Any, other: ~typing.Any, as_rows: bool = None, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, shuffle: bool = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) DataFrame
returns the full column values directly from another connector data source.
- Parameters:
canonical – a direct or generated pd.DataFrame. see context notes below
other – a direct or generated pd.DataFrame. to concatenate
as_rows – (optional) how to concatenate, True adds the connector dataset as rows, False as columns
headers – (optional) a filter of headers from the ‘other’ dataset
drop – (optional) to drop or not drop the headers if specified
dtype – (optional) a filter on data type for the ‘other’ dataset. int, float, bool, object
exclude – (optional) to exclude or include the data types if specified
regex – (optional) a regular expression to search the headers. example ‘^((?!_amt).)*$)’ excludes ‘_amt’
re_ignore_case – (optional) true if the regex should ignore case. Default is False
shuffle – (optional) if the rows in the loaded canonical should be shuffled
seed – this is a place holder, here for compatibility across methods
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
a pd.DataFrame
The other is a pd.DataFrame, a pd.Series or list, a connector contract str reference or a set of parameter instructions on how to generate a pd.Dataframe. the description of each is:
pd.Dataframe -> a deep copy of the pd.DataFrame
pd.Series or list -> creates a pd.DataFrameof one column with the ‘header’ name or ‘default’ if not given
str -> instantiates a connector handler with the connector_name and loads the DataFrame from the connection
- dict -> use canonical2dict(…) to help construct a dict with a ‘method’ to build a pd.DataFrame
- methods:
model_*(…) -> one of the SyntheticBuilder model methods and parameters
- @empty -> generates an empty pd.DataFrame where size and headers can be passed
:size sets the index size of the dataframe :headers any initial headers for the dataframe
- @generate -> generate a synthetic file from a remote Domain Contract
:task_name the name of the SyntheticBuilder task to run :repo_uri the location of the Domain Product :size (optional) a size to generate :seed (optional) if a seed should be applied :run_book (optional) if specific intent should be run only
- model_custom(canonical: ~typing.Any, code_str: str, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None, **kwargs)
Commonly used for custom methods, takes code string that when executed changes the canonical returning the modified canonical. If the method passes returns a pd.Dataframe this will be returned else the assumption is the canonical has been changed inplace and thus the modified canonical will be returned When referencing the canonical in the code_str it should be referenced either by use parameter label ‘canonical’ or the short cut ‘@’ symbol. kwargs can also be passed into the code string but must be preceded by a ‘$’ symbol
- Parameters:
canonical – a direct or generated pd.DataFrame. see context notes below
code_str – an action on those column values
kwargs – a set of kwargs to include in any executable function
seed – (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
a list or pandas.DataFrame
- model_dict_column(canonical: ~typing.Any, header: str, convert_str: bool = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) DataFrame
takes a column that contains dict and expands them into columns. Note, the column must be a flat dictionary. Complex structures will not work.
- Parameters:
canonical – a pd.DataFrame as the reference dataframe
header – the header of the column to be convert
convert_str – (optional) if the header has the dict as a string convert to dict using ast.literal_eval()
seed – (optional) this is a place holder, here for compatibility across methods
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
a pd.DataFrame
- model_difference(canonical: ~typing.Any, other: ~typing.Any, on_key: [<class 'str'>, <class 'list'>], drop_zero_sum: bool = None, summary_connector: bool = None, flagged_connector: str = None, detail_connector: str = None, unmatched_connector: str = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)
returns the difference between two canonicals, joined on a common and unique key. The
on_key
parameter can be a direct reference to the canonical column header or to an environment variable. If the environment variable is usedon_key
should be set to"${<<YOUR_ENVIRON>>}"
where <<YOUR_ENVIRON>> is the environment variable name.If the
flagged connector
parameter is used, a report flagging mismatched left data with right data is produced for this connector where 1 indicate a difference and 0 they are the same. By default this method returns this report but if this parameter is set the original canonical returned. This allows a canonical pipeline to continue through the component while outputting the difference report.If the
detail connector
parameter is used, a detail report of the difference where the left and right values that differ are shown.If the
unmatched connector
parameter is used, the on_key’s that don’t match between left and right are reported- Parameters:
canonical – a direct or generated pd.DataFrame. see context notes below
other – a direct or generated pd.DataFrame. to concatenate
on_key – The name of the key that uniquely joins the canonical to others
drop_zero_sum – (optional) drops rows and columns which has a total sum of zero differences
summary_connector – (optional) a connector name where the summary report is sent
flagged_connector – (optional) a connector name where the differences are flagged
detail_connector – (optional) a connector name where the differences are shown
unmatched_connector – (optional) a connector name where the unmatched keys are shown
seed – (optional) this is a placeholder, here for compatibility across methods
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
a pd.DataFrame
The other is a pd.DataFrame, a pd.Series or list, a connector contract str reference or a set of parameter instructions on how to generate a pd.Dataframe. the description of each is:
pd.Dataframe -> a deep copy of the pd.DataFrame
pd.Series or list -> creates a pd.DataFrameof one column with the ‘header’ name or ‘default’ if not given
str -> instantiates a connector handler with the connector_name and loads the DataFrame from the connection
- dict -> use canonical2dict(…) to help construct a dict with a ‘method’ to build a pd.DataFrame
- methods:
model_*(…) -> one of the SyntheticBuilder model methods and parameters
- @empty -> generates an empty pd.DataFrame where size and headers can be passed
:size sets the index size of the dataframe :headers any initial headers for the dataframe
- @generate -> generate a synthetic file from a remote Domain Contract
:task_name the name of the SyntheticBuilder task to run :repo_uri the location of the Domain Product :size (optional) a size to generate :seed (optional) if a seed should be applied :run_book (optional) if specific intent should be run only
- model_drop_outliers(canonical: ~typing.Any, header: str, measure: [<class 'int'>, <class 'float'>] = None, method: str = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)
Drops rows in the canonical where the values are deemed outliers based on the method and measure. There are three selectable methods of choice, interquartile or empirical, of which interquartile is the default.
The ‘empirical’ rule states that for a normal distribution, nearly all of the data will fall within three standard deviations of the mean. Given mu and sigma, a simple way to identify outliers is to compute a z-score for every value, which is defined as the number of standard deviations away a value is from the mean. therefor measure given should be the z-score or the number of standard deviations away a value is from the mean. The 68–95–99.7 rule, guide the percentage of values that lie within a band around the mean in a normal distribution with a width of two, four and six standard deviations, respectively and thus the choice of z-score
For the ‘interquartile’ range (IQR), also called the midspread, middle 50%, or H‑spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles of a sample set. The IQR can be used to identify outliers by defining limits on the sample values that are a factor k of the IQR below the 25th percentile or above the 75th percentile. The common value for the factor k is 1.5. A factor k of 3 or more can be used to identify values that are extreme outliers.
- param canonical:
a pd.DataFrame as the reference dataframe
- param header:
the header in the DataFrame to correlate
- param method:
(optional) A method to run to identify outliers. interquartile (default) or empirical
- param measure:
(optional) A measure against each method, respectively factor k, z-score, quartile (see above)
- param seed:
(optional) the random seed
- param save_intent:
(optional) if the intent contract should be saved to the property manager
- param column_name:
(optional) the column name that groups intent to create a column
- param intent_order:
(optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
- param replace_intent:
(optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
- param remove_duplicates:
(optional) removes any duplicate intent in any level that is identical
- return:
an equal length list of correlated values
- model_encode_count(canonical: ~typing.Any, headers: [<class 'str'>, <class 'list'>], prefix=None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) DataFrame
encodes categorical data types, In count encoding we replace the categories by the count of the observations that show that category in the dataset. This techniques capture’s the representation of each label in a dataset, but the encoding may not necessarily be predictive of the outcome.
- Parameters:
canonical – a pd.DataFrame as the reference dataframe
headers – the header(s) to apply the encoding
prefix – a str to prefix the column
seed – seed: (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
a pd.DataFrame
- model_encode_integer(canonical: ~typing.Any, headers: [<class 'str'>, <class 'list'>], ranking: list = None, prefix=None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)
Integer encoding replaces the categories by digits from 1 to n, where n is the number of distinct categories of the variable. Integer encoding can be either nominal or orinal.
Nominal data is categorical variables without any particular order between categories. This means that the categories cannot be sorted and there is no natural order between them.
Ordinal data represents categories with a natural, ordered relationship between each category. This means that the categories can be sorted in either ascending or descending order. In order to encode integers as ordinal, a ranking must be provided.
If ranking is given, the return will be ordinal values based on the ranking order of the list. If a categorical value is not found in the list it is grouped with other missing values and given the last ranking.
- Parameters:
canonical – a pd.DataFrame as the reference dataframe
headers – the header(s) to apply the encoding
ranking – (optional) if used, ranks the categorical values to the list given
prefix – a str to prefix the column
seed – seed: (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
a pd.DataFrame
- model_encode_one_hot(canonical: ~typing.Any, headers: [<class 'str'>, <class 'list'>], prefix=None, dtype: ~typing.Any = None, prefix_sep: str = None, dummy_na: bool = False, drop_first: bool = False, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) DataFrame
encodes categorical data types, One hot encoding, consists in encoding each categorical variable with different boolean variables (also called dummy variables) which take values 0 or 1, indicating if a category is present in an observation.
- Parameters:
canonical – a pd.DataFrame as the reference dataframe
headers – the header(s) to apply multi-hot
prefix – str, list of str, or dict of str, String to append DataFrame column names, with equal length.
prefix_sep – str separator, default ‘_’
dummy_na – Add a column to indicate null values, if False nullss are ignored.
drop_first – Whether to get k-1 dummies out of k categorical levels by removing the first level.
dtype – Data type for new columns. Only a single dtype is allowed.
seed – seed: (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
a pd.DataFrame
- model_explode(canonical: ~typing.Any, header: str, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) DataFrame
takes a single column of list values and explodes the DataFrame so row is represented by each elements in the row list
- Parameters:
canonical – a direct or generated pd.DataFrame. see context notes below
header – the header of the column to be exploded
seed – (optional) this is a place holder, here for compatibility across methods
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
a pd.DataFrame
- model_group(canonical: ~typing.Any, group_by: [<class 'str'>, <class 'list'>], headers: [<class 'str'>, <class 'list'>] = None, regex: bool = None, aggregator: str = None, list_choice: int = None, list_max: int = None, drop_group_by: bool = False, seed: int = None, include_weighting: bool = False, freq_precision: int = None, remove_weighting_zeros: bool = False, remove_aggregated: bool = False, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) DataFrame
returns the full column values directly from another connector data source. in addition the the standard groupby aggregators there is also ‘list’ and ‘set’ that returns an aggregated list or set. These can be using in conjunction with ‘list_choice’ and ‘list_size’ allows control of the return values. if list_max is set to 1 then a single value is returned rather than a list of size 1.
- Parameters:
canonical – a direct or generated pd.DataFrame. see context notes below
headers – the column headers to apply the aggregation too
group_by – the column headers to group by
regex – if the column headers is q regex
aggregator – (optional) the aggregator as a function of Pandas DataFrame ‘groupby’ or ‘list’ or ‘set’
list_choice – (optional) used in conjunction with list or set aggregator to return a random n choice
list_max – (optional) used in conjunction with list or set aggregator restricts the list to a n size
drop_group_by – (optional) drops the group by headers
include_weighting – (optional) include a percentage weighting column for each
freq_precision – (optional) a precision for the relative_freq values
remove_aggregated – (optional) if used in conjunction with the weighting then drops the aggrigator column
remove_weighting_zeros – (optional) removes zero values
seed – (optional) this is a place holder, here for compatibility across methods
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
a pd.DataFrame
- model_merge(canonical: ~typing.Any, other: ~typing.Any, left_on: str = None, right_on: str = None, on: str = None, how: str = None, headers: list = None, suffixes: tuple = None, indicator: bool = None, validate: str = None, replace_nulls: bool = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) DataFrame
returns the full column values directly from another connector data source.
- Parameters:
canonical – a direct or generated pd.DataFrame. see context notes below
other – a direct or generated pd.DataFrame. see context notes below
left_on – the canonical key column(s) to join on
right_on – the merging dataset key column(s) to join on
on – if th left and right join have the same header name this can replace left_on and right_on
how – (optional) One of ‘left’, ‘right’, ‘outer’, ‘inner’. Defaults to inner. See below for more detailed description of each method.
headers – (optional) a filter on the headers included from the right side
suffixes – (optional) A tuple of string suffixes to apply to overlapping columns. Defaults (‘’, ‘_dup’).
indicator – (optional) Add a column to the output DataFrame called _merge with information on the source of each row. _merge is Categorical-type and takes on a value of left_only for observations whose merge key only appears in ‘left’ DataFrame or Series, right_only for observations whose merge key only appears in ‘right’ DataFrame or Series, and both if the observation’s merge key is found in both.
validate – (optional) validate : string, default None. If specified, checks if merge is of specified type. “one_to_one” or “1:1”: checks if merge keys are unique in both left and right datasets. “one_to_many” or “1:m”: checks if merge keys are unique in left dataset. “many_to_one” or “m:1”: checks if merge keys are unique in right dataset. “many_to_many” or “m:m”: allowed, but does not result in checks.
replace_nulls – (optional) replaces nulls with an appropriate value dependent upon the field type
seed – this is a placeholder, here for compatibility across methods
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
a pd.DataFrame
The other is a pd.DataFrame, a pd.Series or list, a connector contract str reference or a set of parameter instructions on how to generate a pd.Dataframe. the description of each is:
pd.Dataframe -> a deep copy of the pd.DataFrame
pd.Series or list -> creates a pd.DataFrame of one column with the ‘header’ name or ‘default’ if not given
str -> instantiates a connector handler with the connector_name and loads the DataFrame from the connection
- dict -> use canonical2dict(…) to help construct a dict with a ‘method’ to build a pd.DataFrame
- methods:
model_*(…) -> one of the SyntheticBuilder model methods and parameters
- @empty -> generates an empty pd.DataFrame where size and headers can be passed
:size sets the index size of the dataframe :headers any initial headers for the dataframe
- @generate -> generate a synthetic file from a remote Domain Contract
:task_name the name of the SyntheticBuilder task to run :repo_uri the location of the Domain Product :size (optional) a size to generate :seed (optional) if a seed should be applied :run_book (optional) if specific intent should be run only
- model_missing_cca(canonical: ~typing.Any, threshold: float = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) DataFrame
Applies Complete Case Analysis to the canonical. Complete-case analysis (CCA), also called “list-wise deletion” of cases, consists of discarding observations with any missing values. In other words, we only keep observations with data on all the variables. CCA works well when the data is missing completely at random.
- Parameters:
canonical – a pd.DataFrame as the reference dataframe
threshold – (optional) a null threshold between 0 and 1 where 1 is all nulls. Default to 0.5
seed – (optional) a placeholder
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
a pd.DataFrame
- model_modifier(canonical: ~typing.Any, other: ~typing.Any, targets_header: str = None, values_header: str = None, modifier: str = None, seed: int = None, precision: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) DataFrame
Modifies a given set of target header names, within the canonical with the target value for that name. The aggregator indicates the type of modification to be performed. It is assumed the other DataFrame has the target headers as the first column and the target values as the second column, if this is not the case the targets_header and values_handler parameters can be used to specify the other header names.
- Parameters:
canonical – a pd.DataFrame as the reference dataframe
other – a direct or generated pd.DataFrame. see context notes below
targets_header – (optional) the name of the target header where the header names are listed
values_header – (optional) The name of the value header where the target values are listed
modifier – (optional) how the value is to be modified. Options are ‘add’, ‘sub’, ‘mul’, ‘div’
precision – (optional) the value precision of the return values
seed – (optional) this is a placeholder, here for compatibility across methods
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
a pd.DataFrame
The other is a pd.DataFrame, a pd.Series or list, a connector contract str reference or a set of parameter instructions on how to generate a pd.Dataframe. the description of each is:
pd.Dataframe -> a deep copy of the pd.DataFrame
pd.Series or list -> creates a pd.DataFrame of one column with the ‘header’ name or ‘default’ if not given
str -> instantiates a connector handler with the connector_name and loads the DataFrame from the connection
- dict -> use canonical2dict(…) to help construct a dict with a ‘method’ to build a pd.DataFrame
- methods:
model_*(…) -> one of the SyntheticBuilder model methods and parameters
- @empty -> generates an empty pd.DataFrame where size and headers can be passed
:size sets the index size of the dataframe :headers any initial headers for the dataframe
- @generate -> generate a synthetic file from a remote Domain Contract
:task_name the name of the SyntheticBuilder task to run :repo_uri the location of the Domain Product :size (optional) a size to generate :seed (optional) if a seed should be applied :run_book (optional) if specific intent should be run only
- model_profiling(canonical: ~typing.Any, profiling: str, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, connector_name: str = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None, **kwargs)
Data profiling provides, analyzing, and creating useful summaries of data. The process yields a high-level overview which aids in the discovery of data quality issues, risks, and overall trends. It can be used to identify any errors, anomalies, or patterns that may exist within the data. There are three types of data profiling available ‘dictionary’, ‘schema’ or ‘quality’
If the
connector_name
is used, it outputs the results to this connector contract and returns the original canonical. This allows a canonical pipeline to continue through the component while outputting the data profile to an alternative path.- Parameters:
canonical – a direct or generated pd.DataFrame. see context notes below
profiling – The profiling name. Options are ‘dictionary’, ‘schema’ or ‘quality’
headers – (optional) a filter of headers from the ‘other’ dataset
drop – (optional) to drop or not drop the headers if specified
dtype – (optional) a filter on data type for the ‘other’ dataset. int, float, bool, object
exclude – (optional) to exclude or include the data types if specified
regex – (optional) a regular expression to search the headers. example ‘^((?!_amt).)*$)’ excludes ‘_amt’
re_ignore_case – (optional) true if the regex should ignore case. Default is False
connector_name – (optional) a connector name where the outcome is sent
:param seed:(optional) this is a placeholder, here for compatibility across methods :param save_intent: (optional) if the intent contract should be saved to the property manager :param column_name: (optional) the column name that groups intent to create a column :param intent_order: (optional) the order in which each intent should run.
If None: default’s to -1
if -1: added to a level above any current instance of the intent section, level 0 if not found
if int: added to the level specified, overwriting any that already exist
- Parameters:
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
kwargs – if using connector_name, any kwargs to pass to the handler
- Returns:
a pd.DataFrame
The other is a pd.DataFrame, a pd.Series or list, a connector contract str reference or a set of parameter instructions on how to generate a pd.Dataframe. the description of each is:
pd.Dataframe -> a deep copy of the pd.DataFrame
pd.Series or list -> creates a pd.DataFrameof one column with the ‘header’ name or ‘default’ if not given
str -> instantiates a connector handler with the connector_name and loads the DataFrame from the connection
- dict -> use canonical2dict(…) to help construct a dict with a ‘method’ to build a pd.DataFrame
- methods:
model_*(…) -> one of the SyntheticBuilder model methods and parameters
- @empty -> generates an empty pd.DataFrame where size and headers can be passed
:size sets the index size of the dataframe :headers any initial headers for the dataframe
- @generate -> generate a synthetic file from a remote Domain Contract
:task_name the name of the SyntheticBuilder task to run :repo_uri the location of the Domain Product :size (optional) a size to generate :seed (optional) if a seed should be applied :run_book (optional) if specific intent should be run only
- model_sample(canonical: ~typing.Any, other: ~typing.Any, headers: list, replace: bool = None, relative_freq: list = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) DataFrame
Takes a target dataset and samples from that target to the size of the canonical
- Parameters:
canonical – a pd.DataFrame as the reference dataframe
other – a direct or generated pd.DataFrame. see context notes below
headers – the headers to be selected from the other DataFrame
replace – assuming other is bigger than canonical, selects without replacement when True
relative_freq – (optional) a weighting pattern that does not have to add to 1
seed – (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
a DataFrame
- model_to_category(canonical: ~typing.Any, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)
converts columns to categories
- Parameters:
canonical – a pd.DataFrame as the reference dataframe
headers – a list of headers to drop or filter on type
drop – to drop or not drop the headers
dtype – the column types to include or exclude. Default None else int, float, bool, object, ‘number’
exclude – to exclude or include the dtypes
regex – a regular expression to search the headers
re_ignore_case – true if the regex should ignore case. Default is False
seed – (optional) a placeholder
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
a pd.DataFrame
- model_to_numeric(canonical: ~typing.Any, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, precision: int = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)
converts columns to numeric value
- Parameters:
canonical – a pd.DataFrame as the reference dataframe
headers – a list of headers to drop or filter on type
drop – to drop or not drop the headers
dtype – the column types to include or exclude. Default None else int, float, bool, object, ‘number’
exclude – to exclude or include the dtypes
regex – a regular expression to search the headers
re_ignore_case – true if the regex should ignore case. Default is False
precision – (optional) an int value of the precision for the float
seed – (optional) a placeholder
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
a pd.DataFrame