IntentSyntheticBuild Class

class ds_discovery.intent.synthetic_intent.SyntheticIntentModel(property_manager: ~ds_discovery.managers.synthetic_property_manager.SyntheticPropertyManager, default_save_intent: bool = None, default_intent_level: [<class 'str'>, <class 'int'>, <class 'float'>] = None, order_next_available: bool = None, default_replace_intent: bool = None)

Synthetic data is representative data that, depending on its application, holds statistical and distributive characteristics of its real world counterpart. This component provides a set of actions that focuses on building a synthetic data through knowledge and statistical analysis

static action2dict(method: Any, **kwargs) → dict

a utility method to help build feature conditions by aligning method parameters with dictionary format.

Parameters:

method – the method to execute
kwargs – name value pairs associated with the method

Returns:

dictionary of the parameters

Special method values: @header: use a column as the value reference, expects the ‘header’ key @constant: use a value constant, expects the key ‘value’ @sample: use to get sample values, expected ‘name’ of the Sample method, optional ‘shuffle’ boolean @eval: evaluate a code string, expects the key ‘code_str’ and any locals() required

static canonical2dict(method: Any, **kwargs) → dict

a utility method to help build feature conditions by aligning method parameters with dictionary format. The method parameter can be wither a ‘model_*’ or ‘frame_*’ method with two special reserved options

Special reserved method values: @empty: returns an empty dataframe, optionally the key values size: int and headers: list @generate: generates a dataframe either from_env(task_name) o from a remote repo uri. params are

task_name: the task name of the generator repo_uri: (optional) a remote repo to retrieve the the domain contract size: (optional) the generated sample size seed: (optional) if seeding should be applied the seed value run_book: (optional) a domain contract runbook to execute as part of the pipeline

Parameters:

method – the method to execute
kwargs – name value pairs associated with the method

Returns:

dictionary of the parameters

correlate_activation(canonical: ~typing.Any, header: str, activation: str = None, precision: int = None, seed: int = None, rtn_type: str = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

Activation functions play a crucial role in the backpropagation algorithm, which is the primary algorithm used for training neural networks. During backpropagation, the error of the output is propagated backwards through the network, and the weights of the network are updated based on this error. The activation function is used to introduce non-linearity into the output of a neural network layer.

Logistic Sigmoid a.k.a logit, tmaps any input value to a value between 0 and 1, making it useful for binary classification problems and is defined as f(x) = 1/(1+exp(-x))

Tangent Hyperbolic (tanh) function is a shifted and stretched version of the Sigmoid function but maps the input values to a range between -1 and 1. and is defined as f(x) = (exp(x)-exp(-x))/(exp(x)+exp(-x))

Rectified Linear Unit (ReLU) function. is the most popular activation function, which replaces negative values with zero and keeps the positive values unchanged. and is defined as f(x) = x * (x > 0)

Parameters:

canonical – a pd.DataFrame as the reference dataframe
header – the header in the DataFrame to correlate
activation – (optional) the name of the activation function. Options ‘sigmoid’, ‘tanh’ and ‘relu’
precision – (optional) how many decimal places. default to 3
seed – (optional) the random seed. defaults to current datetime
rtn_type – (optional) changes the default return of a ‘list’ to a pd.Series other than the int, float, category, string and object, passing ‘as-is’ will return as is
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

an equal length list of correlated values

correlate_aggregate(canonical: ~typing.Any, headers: list, agg: str, seed: int = None, rtn_type: str = None, save_intent: bool = None, precision: int = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

correlate two or more columns with each other through a finite set of aggregation functions. The aggregation function names are limited to ‘sum’, ‘prod’, ‘count’, ‘min’, ‘max’ and ‘mean’ for numeric columns and a special ‘list’ function name to combine the columns as a list

Parameters:

canonical – a direct or generated pd.DataFrame. see context notes below
headers – a list of headers to correlate
agg – the aggregation function name enact. The available functions are: ‘sum’, ‘prod’, ‘count’, ‘min’, ‘max’, ‘mean’ and ‘list’ which combines the columns as a list
precision – the value precision of the return values
seed – (optional) a seed value for the random function: default to None
rtn_type – (optional) changes the default return of a ‘list’ to a pd.Series other than the int, float, category, string and object, passing ‘as-is’ will return as is
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a list of equal length to the one passed

correlate_categories(canonical: ~typing.Any, header: str, correlations: list, actions: dict, default_action: [<class 'str'>, <class 'int'>, <class 'float'>, <class 'dict'>] = None, rtn_type: str = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

correlation of a set of values to an action, the correlations must map to the dictionary index values. Note. to use the current value in the passed values as a parameter value pass an empty dict {} as the keys value. If you want the action value to be the current value of the passed value then again pass an empty dict action to be the current value.

simple correlation list:

['A', 'B', 'C'] # if values is 'A' then action is 0 and so on

multiple choice correlation

[['A','B'], 'C'] # if values is 'A' OR 'B' then action is 0 and so on

actions dictionary where the method is a class method followed by its parameters

{0: {'method': 'get_numbers', 'from_value': 0, to_value: 27}}

you can also use the action to specify a specific value:

{0: 'F', 1: {'method': 'get_numbers', 'from_value': 0, to_value: 27}}

Parameters:

canonical – a direct or generated pd.DataFrame. see context notes below
header – the header in the DataFrame to correlate
correlations – a list of categories (can also contain lists for multiple correlations.
actions – the correlated set of categories that should map to the index
default_action – (optional) a default action to take if the selection is not fulfilled
seed – a seed value for the random function: default to None
rtn_type – (optional) changes the default return of a ‘list’ to a pd.Series other than the int, float, category, string and object, passing ‘as-is’ will return as is
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a list of equal length to the one passed

Actions are the resulting outcome of the selection (or the default). An action can be just a value or a dict that executes a intent method such as get_number(). To help build actions there is a helper function called action2dict(…) that takes a method as a mandatory attribute.

With actions there are special keyword ‘method’ values:

@header: use a column as the value reference, expects the ‘header’ key
@constant: use a value constant, expects the key ‘value’
@sample: use to get sample values, expected ‘name’ of the Sample method, optional ‘shuffle’ boolean
@eval: evaluate a code string, expects the key ‘code_str’ and any locals() required

An example of a simple action to return a selection from a list:

{'method': 'get_category', selection=['M', 'F', 'U']

an example of using the helper method, in this example we use the keyword @header to get a value from another column at the same index position:

inst.action2dict(method="@header", header='value')

We can even execute some sort of evaluation at run time:

inst.action2dict(method="@eval", code_str='sum(values)', values=[1,4,2,1])

correlate_custom(canonical: ~typing.Any, code_str: str, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None, **kwargs)

Commonly used for custom list comprehension, takes code string that when evaluated returns a list of values Before using this method, consider the method correlate_selection(…)

When referencing the canonical in the code_str it should be referenced either by use parameter label ‘canonical’ or the short cut ‘@’ symbol. for example:

code_str = "[x + 2 for x in @['A']]" # where 'A' is a header in the canonical

kwargs can also be passed into the code string but must be preceded by a ‘$’ symbol for example:

code_str = "[True if x == $v1 else False for x in @['A']]" # where 'v1' is a kwargs

Parameters:

canonical – a pd.DataFrame as the reference dataframe
code_str – an action on those column values. to reference the canonical use ‘@’
seed – (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
kwargs – a set of kwargs to include in any executable function

Returns:

value set based on the selection list and the action

correlate_date_diff(canonical: ~typing.Any, first_date: str, second_date: str, units: str = None, precision: int = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None, **kwargs)

returns a column for the difference between a primary and secondary date where the primary is an early date than the secondary.

Parameters:

canonical – the DataFrame containing the column headers
first_date – the primary or older date field
second_date – the secondary or newer date field
units – (optional) The Timedelta units e.g. ‘D’, ‘W’, ‘M’, ‘Y’. default is ‘D’
precision – the precision of the result
seed – (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
kwargs – a set of kwargs to include in any executable function

Returns:

value set based on the selection list and the action

correlate_dates(canonical: ~typing.Any, header: str, choice: [<class 'int'>, <class 'float'>, <class 'str'>] = None, choice_header: str = None, offset: [<class 'int'>, <class 'dict'>, <class 'str'>] = None, jitter: [<class 'int'>, <class 'str'>] = None, jitter_units: str = None, ignore_time: bool = None, ignore_seconds: bool = None, min_date: str = None, max_date: str = None, now_delta: str = None, date_format: str = None, day_first: bool = None, year_first: bool = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

correlate a list of continuous dates adjusting those dates, or a subset of those dates, with a normalised jitter along with a value offset. choice, jitter and offset can accept environment variable string names starting with ${ and ending with }.

When using offset and a dict is passed, the dict should take the form {‘days’: 1}, where the unit is plural, to add 1 day or a singular name {‘hour’: 3}, where the unit is singular, to replace the current with 3 hours. Offsets can be ‘years’, ‘months’, ‘weeks’, ‘days’, ‘hours’, ‘minutes’ or ‘seconds’. If an int is passed days are assumed.

Parameters:

canonical – a pd.DataFrame as the reference dataframe
header – the header in the DataFrame to correlate
choice – (optional) The number of values or percentage between 0 and 1 to choose.
choice_header – (optional) those not chosen are given the values of the given header
offset – (optional) Temporal parameter that add to or replace the offset value. if int then assume ‘days’
jitter – (optional) the random jitter or deviation in days
jitter_units – (optional) the units of the jitter, Options: ‘W’, ‘D’, ‘h’, ‘m’, ‘s’. default ‘D’
ignore_time – ignore time elements and only select from Year, Month, Day elements. Default is False
ignore_seconds – ignore second elements and only select from Year to minute elements. Default is False
min_date – (optional)a minimum date not to go below
max_date – (optional)a max date not to go above
now_delta – (optional) returns a delta from now as an int list, Options: ‘Y’, ‘M’, ‘W’, ‘D’, ‘h’, ‘m’, ‘s’
day_first – (optional) if the dates given are day first firmat. Default to True
year_first – (optional) if the dates given are year first. Default to False
date_format – (optional) the format of the output
seed – (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a list of equal size to that given

correlate_discrete_intervals(canonical: ~typing.Any, header: str, granularity: [<class 'int'>, <class 'float'>, <class 'list'>] = None, lower: [<class 'int'>, <class 'float'>] = None, upper: [<class 'int'>, <class 'float'>] = None, categories: list = None, precision: int = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

converts continuous representation into discrete representation through interval categorisation

Parameters:

canonical – a pd.DataFrame as the reference dataframe
header – the header in the DataFrame to correlate
granularity – (optional) the granularity of the analysis across the range. Default is 3 - int passed - represents the number of periods - float passed - the length of each interval - list[tuple] - specific interval periods e.g [] -list[float] - the percentile or quantities, All should fall between 0 and 1
lower – (optional) the lower limit of the number value. Default min()
upper – (optional) the upper limit of the number value. Default max()
precision – (optional) The precision of the range and boundary values. by default set to 5.
categories – (optional) a set of labels the same length as the intervals to name the categories
seed – seed: (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a list of equal length to the one passed

correlate_join(canonical: ~typing.Any, header: str, action: [<class 'str'>, <class 'dict'>], sep: str = None, rtn_type: str = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

correlate a column and join it with the result of the action, This allows for composite values to be build from. an example might be to take a forename and add the surname with a space separator to create a composite name field, of to join two primary keys to create a single composite key.

Parameters:

canonical – a direct or generated pd.DataFrame. see context notes below
header – an ordered list of columns to join
action – (optional) a string or a single action whose outcome will be joined to the header value
sep – (optional) a separator between the values
seed – (optional) a seed value for the random function: default to None
rtn_type – (optional) changes the default return of a ‘list’ to a pd.Series - other than the int, float, category, string and object, passing ‘as-is’ will return as is
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a list of equal length to the one passed

Actions are the resulting outcome of the selection (or the default). An action can be just a value or a dict that executes a intent method such as get_number(). To help build actions there is a helper function called action2dict(…) that takes a method as a mandatory attribute.

With actions there are special keyword ‘method’ values:

@header: use a column as the value reference, expects the ‘header’ key
@constant: use a value constant, expects the key ‘value’
@sample: use to get sample values, expected ‘name’ of the Sample method, optional ‘shuffle’ boolean
@eval: evaluate a code string, expects the key ‘code_str’ and any locals() required

An example of a simple action to return a selection from a list:

{'method': 'get_category', selection=['M', 'F', 'U']

an example of using the helper method, in this example we use the keyword @header to get a value from another column at the same index position:

inst.action2dict(method="@header", header='value')

We can even execute some sort of evaluation at run time:

inst.action2dict(method="@eval", code_str='sum(values)', values=[1,4,2,1])

correlate_list_element(canonical: ~typing.Any, header: str, list_size: int = None, random_choice: bool = None, replace: bool = None, shuffle: bool = None, convert_str: bool = None, rtn_type: str = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

correlate a column where the elements of the columns contains a list, and a choice is taken from that list. if the list_size == 1 then a single value is correlated otherwise a list is correlated

Null values are passed through but all other elements must be a list with at least 1 value in.

if ‘random’ is true then all returned values will be a random selection from the list and of equal length. if ‘random’ is false then each list will not exceed the ‘list_size’

Also if ‘random’ is true and ‘replace’ is False then all lists must have more elements than the list_size. By default ‘replace’ is True and ‘shuffle’ is False.

In addition ‘convert_str’ allows lists that have been formatted as a string can be converted from a string to a list using ‘ast.literal_eval(x)’

Parameters:

canonical – a direct or generated pd.DataFrame. see context notes below
header – The header containing a list to chose from.
list_size – (optional) the number of elements to return, if more than 1 then list
random_choice – (optional) if the choice should be a random choice.
replace – (optional) if the choice selection should be replaced or selected only once
shuffle – (optional) if the final list should be shuffled
convert_str – if the header has the list as a string convert to list using ast.literal_eval()
seed – (optional) a seed value for the random function: default to None
rtn_type – (optional) changes the default return of a ‘list’ to a pd.Series other than the int, float, category, string and object, passing ‘as-is’ will return as is
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a list of equal length to the one passed

correlate_mark_outliers(canonical: ~typing.Any, header: str, measure: [<class 'int'>, <class 'float'>] = None, method: str = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

Drops rows in the canonical where the values are deemed outliers based on the method and measure. There are three selectable methods of choice, interquartile or empirical, of which interquartile is the default.

The ‘empirical’ rule states that for a normal distribution, nearly all of the data will fall within three standard deviations of the mean. Given mu and sigma, a simple way to identify outliers is to compute a z-score for every value, which is defined as the number of standard deviations away a value is from the mean. therefor measure given should be the z-score or the number of standard deviations away a value is from the mean. The 68–95–99.7 rule, guide the percentage of values that lie within a band around the mean in a normal distribution with a width of two, four and six standard deviations, respectively and thus the choice of z-score

For the ‘interquartile’ range (IQR), also called the midspread, middle 50%, or H‑spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles of a sample set. The IQR can be used to identify outliers by defining limits on the sample values that are a factor k of the IQR below the 25th percentile or above the 75th percentile. The common value for the factor k is 1.5. A factor k of 3 or more can be used to identify values that are extreme outliers.

param canonical:

a pd.DataFrame as the reference dataframe

param header:

the header in the DataFrame to correlate

param method:

(optional) A method to run to identify outliers. interquartile (default) or empirical

param measure:

(optional) A measure against each method, respectively factor k, z-score, quartile (see above)

param seed:

(optional) the random seed

param save_intent:

(optional) if the intent contract should be saved to the property manager

param column_name:

(optional) the column name that groups intent to create a column

param intent_order:

(optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

param replace_intent:

(optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

param remove_duplicates:

(optional) removes any duplicate intent in any level that is identical

return:

an equal length list of correlated values

correlate_missing(canonical: ~typing.Any, header: str, method: str = None, weights: str = None, constant: ~typing.Any = None, precision: int = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

imputes missing data with statistical estimates of the missing values. The methods are ‘mean’, ‘median’, ‘mode’ and ‘random’ with the addition of ‘constant’ and ‘indicator’

Mean/median imputation consists of replacing all occurrences of missing values (NA) within a variable by the mean (if the variable has a Gaussian distribution) or median (if the variable has a skewed distribution). Can only be applied to numeric values.

Mode imputation consists of replacing all occurrences of missing values (NA) within a variable by the mode, which is the most frequent value or most frequent category. Can be applied to both numerical and categorical variables.

Random sampling imputation is in principle similar to mean, median, and mode imputation in that it considers that missing values, should look like those already existing in the distribution. Random sampling consists of taking random observations from the pool of available data and using them to replace the NA. In random sample imputation, we take as many random observations as missing values exist in the variable. Can be applied to both numerical and categorical variables.

Neighbour imputation is for filling in missing values using the k-Nearest Neighbors approach. Each missing feature is imputed using values from five nearest neighbors that have a value for the feature. The feature of the neighbors are averaged uniformly or weighted by distance to each neighbor. If a sample has more than one feature missing, then the neighbors for that sample can be different depending on the particular feature being imputed. When the number of available neighbors is less than five the average for that feature is used during imputation. If there is at least one neighbor with a defined distance, the weighted or unweighted average of the remaining neighbors will be used during imputation.

Constant or Arbitrary value imputation consists of replacing all occurrences of missing values (NA) with an arbitrary constant value. Can be applied to both numerical and categorical variables. A value must be passed in the constant parameter relevant to the column type.

Indicator is not an imputation method but imputation techniques, such as mean, median and random will affect the variable distribution quite dramatically and is a good idea to flag them with a missing indicator. This must be done before imputation of the column.

Parameters:

canonical – a pd.DataFrame as the reference dataframe
header – the header in the DataFrame to correlate
method – (optional) the replacement method, ‘mean’, ‘median’, ‘mode’, ‘constant’, ‘random’, ‘indicator’
weights – (optional) Weight function used in prediction of nearest neighbour if used as method. Options ‘uniform’ : uniform weights. All points in each neighborhood are weighted equally. ‘distance’ : weight points by the inverse of their distance.
constant – (optional) a value to us when the method is constant
precision – (optional) if numeric, the precision of the outcome, by default set to 3.
seed – (optional) the random seed. defaults to current datetime
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

an equal length list of correlated values

correlate_numbers(canonical: ~typing.Any, header: str, standardize: bool = None, normalize: bool = None, scalar: tuple = None, transform: str = None, precision: int = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

Allows for the scaling transformation of a continuous value set. scaling methods. Thse techniques are used to alter the values of a variable so that they are expressed on a common scale. This is often done to make it easier to compare different variables or to make it easier to analyze data.

Parameters:

canonical – a pd.DataFrame as the reference dataframe
header – the header in the DataFrame to correlate
standardize – (optional) standardise continuous variables with mean 0 and std 1
normalize – (optional) normalize continuous variables between 0 an 1.
scalar – (optional) scales continuous variables between a mix and max value passed in the tuple pair.
transform – (optional) attempts normal distribution of continuous variables. options are log, sqrt, cbrt, boxcox, yeojohnson
precision – (optional) how many decimal places. default to 3
seed – (optional) the random seed. defaults to current datetime
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

an equal length list of correlated values

Returns:

an equal length list of correlated values

The offset can be a numeric offset that is added to the value, e.g. passing 2 will add 2 to all values. If a string is passed if format should be a calculation with the ‘@’ character used to represent the column value. e.g.

'1-@' would subtract the column value from 1,
'@*0.5' would multiply the column value by 0.5

correlate_polynomial(canonical: ~typing.Any, header: str, coefficient: list, rtn_type: str = None, seed: int = None, keep_zero: bool = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

creates a polynomial using the reference header values and apply the coefficients where the index of the list represents the degree of the term in reverse order.

e.g [6, -2, 0, 4] => f(x) = 4x**3 - 2x + 6

Parameters:

canonical – a direct or generated pd.DataFrame. see context notes below
header – the header in the DataFrame to correlate
coefficient – the reverse list of term coefficients
seed – (optional) the random seed. defaults to current datetime
rtn_type – (optional) changes the default return of a ‘list’ to a pd.Series other than the int, float, category, string and object, passing ‘as-is’ will return as is
keep_zero – (optional) if True then zeros passed remain zero, Default is False
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

an equal length list of correlated values

correlate_selection(canonical: ~typing.Any, selection: list, action: [<class 'str'>, <class 'int'>, <class 'float'>, <class 'dict'>], default_action: [<class 'str'>, <class 'int'>, <class 'float'>, <class 'dict'>] = None, seed: int = None, rtn_type: str = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

returns a value set based on the selection list and the action enacted on that selection. If the selection criteria is not fulfilled then the default_action is taken if specified, else null value.

If a DataFrame is not passed, the values column is referenced by the header ‘_default’

Parameters:

canonical – a direct or generated pd.DataFrame. see context notes below
selection – a list of selections where conditions are filtered on, executed in list order

An example of a selection with the minimum requirements is: (see ‘select2dict(…)’)

[{'column': 'genre', 'condition': "=='Comedy'"}]

Parameters:: action – a value or dict to act upon if the select is successful. see below for more examples

An example of an action as a dict: (see ‘action2dict(…)’)

{'method': 'get_category', 'selection': ['M', 'F', 'U']}

Parameters:

default_action – (optional) a default action to take if the selection is not fulfilled
seed – (optional) a seed value for the random function: default to None
rtn_type – (optional) changes the default return of a ‘list’ to a pd.Series - other than the int, float, category, string and object, passing ‘as-is’ will return as is
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

value set based on the selection list and the action

Selections are a list of dictionaries of conditions and optional additional parameters to filter. To help build conditions there is a static helper method called ‘select2dict(…)’ that has parameter options available to build a condition. An example of a condition with the minimum requirements is [{‘column’: ‘genre’, ‘condition’: “==’Comedy’”}]

an example of using the helper method

selection = [inst.select2dict(column='gender', condition="=='M'"),
             inst.select2dict(column='age', condition=">65", logic='XOR')]

Using the ‘select2dict’ method ensure the correct keys are used and the dictionary is properly formed. It also helps with building the logic that is executed in order

Actions are the resulting outcome of the selection (or the default). An action can be just a value or a dict that executes a intent method such as get_number(). To help build actions there is a helper function called action2dict(…) that takes a method as a mandatory attribute.

With actions there are special keyword ‘method’ values:

@header: use a column as the value reference, expects the ‘header’ key
@constant: use a value constant, expects the key ‘value’
@sample: use to get sample values, expected ‘name’ of the Sample method, optional ‘shuffle’ boolean
@eval: evaluate a code string, expects the key ‘code_str’ and any locals() required

An example of a simple action to return a selection from a list:

{'method': 'get_category', selection: ['M', 'F', 'U']}

This same action using the helper method would look like:

inst.action2dict(method='get_category', selection=['M', 'F', 'U'])

an example of using the helper method, in this example we use the keyword @header to get a value from another column at the same index position:

inst.action2dict(method="@header", header='value')

We can even execute some sort of evaluation at run time:

inst.action2dict(method="@eval", code_str='sum(values)', values=[1,4,2,1])

correlate_values(canonical: ~typing.Any, header: str, choice: [<class 'int'>, <class 'float'>, <class 'str'>] = None, choice_header: str = None, precision: int = None, jitter: [<class 'int'>, <class 'float'>, <class 'str'>] = None, offset: [<class 'int'>, <class 'float'>, <class 'str'>] = None, transform: ~typing.Any = None, lower: [<class 'int'>, <class 'float'>] = None, upper: [<class 'int'>, <class 'float'>] = None, keep_zero: bool = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → list

correlate a list of continuous values adjusting those values, or a subset of those values, with a normalised jitter (std from the value) along with a value offset. choice, jitter and offset can accept environment variable string names starting with ${ and ending with }.

Parameters:

canonical – a pd.DataFrame as the reference dataframe
header – the header in the DataFrame to correlate
choice – (optional) The number of values to choose to apply the change to. Can be an environment variable.
choice_header – (optional) those not chosen are given the values of the given header
precision – (optional) to what precision the return values should be
offset – (optional) a fixed value to offset or if str an operation to perform using @ as the header value.
transform – (optional) passing a lambda function to transform the value. e.g. lambda x: (x - 3) / 2
jitter – (optional) a perturbation of the value where the jitter is a random normally distributed std
precision – (optional) how many decimal places. default to 3
seed – (optional) the random seed. defaults to current datetime
keep_zero – (optional) if True then zeros passed remain zero despite a change, Default is False
lower – a minimum value not to go below
upper – a max value not to go above
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

an equal length list of correlated values

frame_selection(canonical: ~typing.Any, selection: list = None, choice: int = None, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → DataFrame

Selects rows and/or columns changing the shape of the DatFrame. This is always run last in a pipeline Rows are filtered before the column filter so columns can be referenced even though they might not be included the final column list.

Parameters:

canonical – a direct or generated pd.DataFrame. see context notes below
selection – a list of selections where conditions are filtered on, executed in list order

An example of a selection with the minimum requirements is: (see ‘select2dict(…)’)

[{'column': 'genre', 'condition': "=='Comedy'"}]

Parameters:

choice – a number of rows to select, randomly selected from the index
headers – a list of headers to drop or filter on type
drop – to drop or not drop the headers
dtype – the column types to include or excluse. Default None else int, float, bool, object, ‘number’
exclude – to exclude or include the dtypes
regex – a regular expression to search the headers. example ‘^((?!_amt).)*$)’ excludes ‘_amt’ columns
re_ignore_case – true if the regex should ignore case. Default is False
seed – this is a place holder, here for compatibility across methods
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

pd.DataFrame

Selections are a list of dictionaries of conditions and optional additional parameters to filter. To help build conditions there is a static helper method called ‘select2dict(…)’ that has parameter options available to build a condition. An example of a condition with the minimum requirements is [{‘column’: ‘genre’, ‘condition’: “==’Comedy’”}]

an example of using the helper method

selection = [inst.select2dict(column='gender', condition="=='M'"),
             inst.select2dict(column='age', condition=">65", logic='XOR')]

Using the ‘select2dict’ method ensure the correct keys are used and the dictionary is properly formed. It also helps with building the logic that is executed in order

frame_starter(canonical: ~typing.Any, selection: list = None, choice: int = None, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, rename_map: dict = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → DataFrame

Selects rows and/or columns changing the shape of the DatFrame. This is always run first in a pipeline Rows are filtered before columns are filtered so columns can be referenced even though they might not be included in the final column list.

Parameters:

canonical – a direct or generated pd.DataFrame. see context notes below
choice – a number of rows to select, randomly selected from the index
selection – a list of selections where conditions are filtered on, executed in list order

An example of a selection with the minimum requirements is: (see ‘select2dict(…)’)

[{'column': 'genre', 'condition': "=='Comedy'"}]

Parameters:

headers – a list of headers to drop or filter on type
drop – to drop or not drop the headers
dtype – the column types to include or excluse. Default None else int, float, bool, object, ‘number’
exclude – to exclude or include the dtypes
regex – a regular expression to search the headers. example ‘^((?!_amt).)*$)’ excludes ‘_amt’ columns
re_ignore_case – true if the regex should ignore case. Default is False
seed – this is a place holder, here for compatibility across methods
rename_map – a from: to dictionary of headers to rename
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

pd.DataFrame

The canonical is a pd.DataFrame, a pd.Series or list, a connector contract str reference or a set of parameter instructions on how to generate a pd.Dataframe. the description of each is:

pd.Dataframe -> a deep copy of the pd.DataFrame
pd.Series or list -> creates a pd.DataFrameof one column with the ‘header’ name or ‘default’ if not given
str -> instantiates a connector handler with the connector_name and loads the DataFrame from the connection
int -> generates an empty pd.Dataframe with an index size of the int passed.
dict -> use canonical2dict(…) to help construct a dict with a ‘method’ to build a pd.DataFrame

- model_*(...) -> one of the SyntheticBuilder model methods and parameters

- @empty -> generates an empty pd.DataFrame where size and headers can be passed: :size sets the index size of the dataframe :headers any initial headers for the dataframe

- @generate -> generate a synthetic file from a remote Domain Contract: :task_name the name of the SyntheticBuilder task to run :repo_uri the location of the Domain Product :size (optional) a size to generate :seed (optional) if a seed should be applied :run_book (optional) if specific intent should be run only

Selections are a list of dictionaries of conditions and optional additional parameters to filter. To help build conditions there is a static helper method called ‘select2dict(…)’ that has parameter options available to build a condition. An example of a condition with the minimum requirements is [{‘column’: ‘genre’, ‘condition’: “==’Comedy’”}]

an example of using the helper method

selection = [inst.select2dict(column='gender', condition="=='M'"),
             inst.select2dict(column='age', condition=">65", logic='XOR')]

Using the ‘select2dict’ method ensure the correct keys are used and the dictionary is properly formed. It also helps with building the logic that is executed in order

get_category(selection: list, relative_freq: list = None, quantity: float = None, size: int = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → list

returns a category from a list. Of particular not is the at_least parameter that allows you to control the number of times a selection can be chosen.

Parameters:

selection – a list of items to select from
relative_freq – a weighting pattern that does not have to add to 1
quantity – a number between 0 and 1 representing the percentage quantity of the data
size – an optional size of the return. default to 1
seed – a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

an item or list of items chosen from the list

get_datetime(start: ~typing.Any, until: ~typing.Any, relative_freq: list = None, at_most: int = None, ordered: str = None, date_format: str = None, as_num: bool = None, ignore_time: bool = None, ignore_seconds: bool = None, size: int = None, quantity: float = None, seed: int = None, day_first: bool = None, year_first: bool = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → list

returns a random date between two date and/or times. weighted patterns can be applied to the overall date range. if a signed ‘int’ type is passed to the start and/or until dates, the inferred date will be the current date time with the integer being the offset from the current date time in ‘days’.

Note: If no patterns are set this will return a linearly random number between the range boundaries.

Parameters:

start – the start boundary of the date range can be str, datetime, pd.datetime, pd.Timestamp or int
until – up until boundary of the date range can be str, datetime, pd.datetime, pd.Timestamp or int
quantity – the quantity of values that are not null. Number between 0 and 1
relative_freq – (optional) A pattern across the whole date range.
at_most – the most times a selection should be chosen
ordered – order the data ascending ‘asc’ or descending ‘dec’, values accepted ‘asc’ or ‘des’
ignore_time – ignore time elements and only select from Year, Month, Day elements. Default is False
ignore_seconds – ignore second elements and only select from Year to minute elements. Default is False
date_format – the string format of the date to be returned. if not set then pd.Timestamp returned
as_num – returns a list of Matplotlib date values as a float. Default is False
size – the size of the sample to return. Default to 1
seed – a seed value for the random function: default to None
year_first – specifies if to parse with the year first - If True parses dates with the year first, e.g. 10/11/12 is parsed as 2010-11-12. - If both dayfirst and yearfirst are True, yearfirst is preceded (same as dateutil).
day_first – specifies if to parse with the day first - If True, parses dates with the day first, eg %d-%m-%Y. - If False default to a preferred preference, normally %m-%d-%Y (but not strict)
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a date or size of dates in the format given.

get_dist_bernoulli(probability: float, size: int = None, quantity: float = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → list

A Bernoulli discrete random distribution using scipy

Parameters:

probability – the probability occurrence
size – the size of the sample
quantity – a number between 0 and 1 representing data that isn’t null
seed – a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a random number

get_dist_bounded_normal(mean: float, std: float, lower: float, upper: float, precision: int = None, size: int = None, quantity: float = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → list

A bounded normal continuous random distribution.

Parameters:

mean – the mean of the distribution
std – the standard deviation
lower – the lower limit of the distribution
upper – the upper limit of the distribution
precision – the precision of the returned number. if None then assumes int value else float
size – the size of the sample
quantity – a number between 0 and 1 representing data that isn’t null
seed – a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a random number

get_dist_choice(number: [<class 'int'>, <class 'str'>, <class 'float'>], size: int = None, quantity: float = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → list

Creates a list of latent values of 0 or 1 where 1 is randomly selected both upon the number given. The: number parameter can be a direct reference to the canonical column header or to an environment variable. If the environment variable is used number should be set to "${<<YOUR_ENVIRON>>}" where <<YOUR_ENVIRON>> is the environment variable name

Parameters:

number – The number of true (1) values to randomly chose from the canonical. see below
size – the size of the sample. if a tuple of intervals, size must match the tuple
quantity – a number between 0 and 1 representing data that isn’t null
seed – a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. If None: default’s to -1 if -1: added to a level above any current instance of the intent section, level 0 if not found if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level True - replaces the current intent method with the new False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a list of 1 or 0

as choice is a fixed value, number can be represented by an environment variable with the format ‘${NAME}’ where NAME is the environment variable name

get_dist_normal(mean: float, std: float, precision: int = None, size: int = None, quantity: float = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → list

A normal (Gaussian) continuous random distribution.

Parameters:

mean – The mean (“centre”) of the distribution.
std – The standard deviation (jitter or “width”) of the distribution. Must be >= 0
precision – The number of decimal points. The default is 3
size – the size of the sample. if a tuple of intervals, size must match the tuple
quantity – a number between 0 and 1 representing data that isn’t null
seed – a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a random number

get_distribution(distribution: str, is_stats: bool = None, precision: int = None, size: int = None, quantity: float = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None, **kwargs) → list

returns a number based the distribution type.

Parameters:

distribution – The string name of the distribution function from numpy random Generator class
is_stats – (optional) if the generator is from the stats package and not numpy
precision – (optional) the precision of the returned number
size – (optional) the size of the sample
quantity – (optional) a number between 0 and 1 representing data that isn’t null
seed – (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
kwargs – the parameters of the method

Returns:

a random number

get_intervals(intervals: list, relative_freq: list = None, precision: int = None, size: int = None, quantity: float = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → list

returns a number based on a list selection of tuple(lower, upper) interval

Parameters:

intervals – a list of unique tuple pairs representing the interval lower and upper boundaries
relative_freq – a weighting pattern or probability that does not have to add to 1
precision – the precision of the returned number. if None then assumes int value else float
size – the size of the sample
quantity – a number between 0 and 1 representing data that isn’t null
seed – a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a random number

get_number(from_value: [<class 'int'>, <class 'float'>, <class 'str'>] = None, to_value: [<class 'int'>, <class 'float'>, <class 'str'>] = None, relative_freq: list = None, precision: int = None, ordered: str = None, at_most: int = None, size: int = None, quantity: float = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → list

returns a number in the range from_value to to_value. if only to_value given from_value is zero

Parameters:

from_value – (signed) integer or float to start from. See below for str
to_value – optional, (signed) integer or float the number sequence goes to but not include. See below
relative_freq – a weighting pattern or probability that does not have to add to 1
precision – the precision of the returned number. if None then assumes int value else float
ordered – order the data ascending ‘asc’ or descending ‘dec’, values accepted ‘asc’ or ‘des’
at_most – the most times a selection should be chosen
size – the size of the sample
quantity – a number between 0 and 1 representing data that isn’t null
seed – a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a random number

The values can be represented by an environment variable with the format ‘${NAME}’ where NAME is the environment variable name

get_sample(sample_name: str, sample_size: int = None, shuffle: bool = None, size: int = None, quantity: float = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

returns a sample set based on sector and name To see the sample sets available use the Sample class __dir__() method:

> from ds_discovery.sample.sample_data import Sample > Sample().__dir__()

Parameters:

sample_name – The name of the Sample method to be used.
sample_size – (optional) the size of the sample to take from the reference file
shuffle – (optional) if the selection should be shuffled before selection. Default is true
quantity – (optional) a number between 0 and 1 representing the percentage quantity of the data
size – (optional) size of the return. default to 1
seed – (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a sample list

get_selection(select_source: str, column_header: str, relative_freq: list = None, sample_size: int = None, selection_size: int = None, size: int = None, shuffle: bool = None, quantity: float = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → list

returns a random list of values where the selection of those values is taken from a connector source.

Parameters:

select_source – the selection source for the reference dataframe
column_header – the name of the column header to correlate
relative_freq – (optional) a weighting pattern of the final selection
selection_size – (optional) the selection to take from the sample size, normally used with shuffle
sample_size – (optional) the size of the sample to take from the reference file
shuffle – (optional) if the selection should be shuffled before selection. Default is true
quantity – (optional) a number between 0 and 1 representing the percentage quantity of the data
size – (optional) size of the return. default to 1
seed – (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

list

The select_source is normally a connector contract str reference or a set of parameter instructions on how to generate a pd.Dataframe but can be a pd.DataFrame. the description of each is:

pd.Dataframe -> a deep copy of the pd.DataFrame
pd.Series or list -> creates a pd.DataFrameof one column with the ‘header’ name or ‘default’ if not given
str -> instantiates a connector handler with the connector_name and loads the DataFrame from the connection
int -> generates an empty pd.Dataframe with an index size of the int passed.
dict -> use canonical2dict(…) to help construct a dict with a ‘method’ to build a pd.DataFrame
methods:
model_*(…) -> one of the SyntheticBuilder model methods and parameters

@empty -> generates an empty pd.DataFrame where size and headers can be passed
:size sets the index size of the dataframe :headers any initial headers for the dataframe

@generate -> generate a synthetic file from a remote Domain Contract
:task_name the name of the SyntheticBuilder task to run :repo_uri the location of the Domain Product :size (optional) a size to generate :seed (optional) if a seed should be applied :run_book (optional) if specific intent should be run only

get_string_pattern(pattern: str, choices: dict = None, as_binary: bool = None, quantity: [<class 'float'>, <class 'int'>] = None, size: int = None, choice_only: bool = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → list

Returns a random string based on the pattern given. The pattern is made up from the choices passed but by default is as follows:

c = random char [a-z][A-Z]

d = digit [0-9]

l = lower case char [a-z]

U = upper case char [A-Z]

p = all punctuation

s = space

you can also use punctuation in the pattern that will be retained A pattern example might be

uuddsduu => BA12 2NE or dl-{uu} => 4g-{FY}

to create your own choices pass a dictionary with a reference char key with a list of choices as a value

Parameters:

pattern – the pattern to create the string from
choices – (optional) an optional dictionary of list of choices to replace the default.
as_binary – (optional) if the return string is prefixed with a b
quantity – (optional) a number between 0 and 1 representing the percentage quantity of the data
size – (optional) the size of the return list. if None returns a single value
choice_only – (optional) if to only use the choices given or to take not found characters as is
seed – (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a string based on the pattern

get_tagged_pattern(pattern: [<class 'str'>, <class 'list'>], tags: dict, relative_freq: list = None, size: int = None, quantity: [<class 'float'>, <class 'int'>] = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → list

Returns the pattern with the tags substituted by tag choice example ta dictionary:

{ '<slogan>': {'action': '', 'kwargs': {}},
  '<phone>': {'action': '', 'kwargs': {}}
}

where action is a method name and kwargs are the arguments to pass for sample data use that method

Parameters:

pattern – a string or list of strings to apply the ta substitution too
tags – a dictionary of tas and actions
relative_freq – a weighting pattern that does not have to add to 1
quantity – a number between 0 and 1 representing the percentage quantity of the data
size – an optional size of the return. default to 1
seed – a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a list of patterns with tas replaced

get_uuid(version: int = None, as_hex: bool = None, size: int = None, quantity: float = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None, **kwargs) → list

returns a list of UUID’s based on the version presented. By default the uuid version is 4. optional parameters for the version number UUID generator can be passed as kwargs.

Version 1: Generate a UUID from a host ID, sequence number, and the current time. Note as uuid1

contains the computers network address it may compromise privacy

param node: (optional) used instead of getnode() which returns a hardware address
param clock_seq: (optional) used as a sequence number alternative

Version 3: Generate a UUID based on the MD5 hash of a namespace identifier and a name

param namespace: an alternative namespace as a UUID e.g. uuid.NAMESPACE_DNS
param name: a string name

Version 4: Generate a random UUID

Version 5: Generate a UUID based on the SHA-1 hash of a namespace identifier and name

param namespace: an alternative namespace as a UUID e.g. uuid.NAMESPACE_DNS
param name: a string name

Parameters:

version – The version of the UUID to use. 1, 3, 4 or 5
as_hex – if the return value is in hex format, else as a string
size – the size of the sample. Must be smaller than the range
quantity – a number between 0 and 1 representing the percentage quantity of the data
seed – a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a unique identifier randomly selected from the range

model_analysis(canonical: ~typing.Any, other: ~typing.Any, columns_list: list = None, exclude_associate: list = None, detail_numeric: bool = None, strict_typing: bool = None, category_limit: int = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → DataFrame

builds a set of columns based on an other (see analyse_association) if a reference DataFrame is passed then as the analysis is run if the column already exists the row value will be taken as the reference to the sub category and not the random value. This allows already constructed association to be used as reference for a sub category.

Parameters:

canonical – a pd.DataFrame as the reference dataframe
other – a direct or generated pd.DataFrame. see context notes below
columns_list – (optional) a list structure of columns to select for association
exclude_associate – (optional) a list of dot separated tree of items to exclude from iteration (e.g. [‘age.gender.salary’]
detail_numeric – (optional) as a default, if numeric columns should have detail stats, slowing analysis
strict_typing – (optional) stops objects and string types being seen as categories
category_limit – (optional) a global cap on categories captured. zero value returns no limits
seed – seed: (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a DataFrame

The other is a pd.DataFrame, a pd.Series, int or list, a connector contract str reference or a set of parameter instructions on how to generate a pd.Dataframe. the description of each is:

pd.Dataframe -> a deep copy of the pd.DataFrame
pd.Series or list -> creates a pd.DataFrame of one column with the ‘header’ name or ‘default’ if not given
str -> instantiates a connector handler with the connector_name and loads the DataFrame from the connection
int -> generates an empty pd.Dataframe with an index size of the int passed.
dict -> use canonical2dict(…) to help construct a dict with a ‘method’ to build a pd.DataFrame
methods:
model_*(…) -> one of the SyntheticBuilder model methods and parameters

@empty -> generates an empty pd.DataFrame where size and headers can be passed
:size sets the index size of the dataframe :headers any initial headers for the dataframe

@generate -> generate a synthetic file from a remote Domain Contract
:task_name the name of the SyntheticBuilder task to run :repo_uri the location of the Domain Product :size (optional) a size to generate :seed (optional) if a seed should be applied :run_book (optional) if specific intent should be run only

model_concat(canonical: ~typing.Any, other: ~typing.Any, as_rows: bool = None, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, shuffle: bool = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → DataFrame

returns the full column values directly from another connector data source.

Parameters:

canonical – a direct or generated pd.DataFrame. see context notes below
other – a direct or generated pd.DataFrame. to concatenate
as_rows – (optional) how to concatenate, True adds the connector dataset as rows, False as columns
headers – (optional) a filter of headers from the ‘other’ dataset
drop – (optional) to drop or not drop the headers if specified
dtype – (optional) a filter on data type for the ‘other’ dataset. int, float, bool, object
exclude – (optional) to exclude or include the data types if specified
regex – (optional) a regular expression to search the headers. example ‘^((?!_amt).)*$)’ excludes ‘_amt’
re_ignore_case – (optional) true if the regex should ignore case. Default is False
shuffle – (optional) if the rows in the loaded canonical should be shuffled
seed – this is a place holder, here for compatibility across methods
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a pd.DataFrame

The other is a pd.DataFrame, a pd.Series or list, a connector contract str reference or a set of parameter instructions on how to generate a pd.Dataframe. the description of each is:

pd.Dataframe -> a deep copy of the pd.DataFrame
pd.Series or list -> creates a pd.DataFrameof one column with the ‘header’ name or ‘default’ if not given
str -> instantiates a connector handler with the connector_name and loads the DataFrame from the connection
dict -> use canonical2dict(…) to help construct a dict with a ‘method’ to build a pd.DataFrame
methods:
model_*(…) -> one of the SyntheticBuilder model methods and parameters

@empty -> generates an empty pd.DataFrame where size and headers can be passed
:size sets the index size of the dataframe :headers any initial headers for the dataframe

@generate -> generate a synthetic file from a remote Domain Contract
:task_name the name of the SyntheticBuilder task to run :repo_uri the location of the Domain Product :size (optional) a size to generate :seed (optional) if a seed should be applied :run_book (optional) if specific intent should be run only

model_custom(canonical: ~typing.Any, code_str: str, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None, **kwargs)

Commonly used for custom methods, takes code string that when executed changes the canonical returning the modified canonical. If the method passes returns a pd.Dataframe this will be returned else the assumption is the canonical has been changed inplace and thus the modified canonical will be returned When referencing the canonical in the code_str it should be referenced either by use parameter label ‘canonical’ or the short cut ‘@’ symbol. kwargs can also be passed into the code string but must be preceded by a ‘$’ symbol

Parameters:

canonical – a direct or generated pd.DataFrame. see context notes below
code_str – an action on those column values
kwargs – a set of kwargs to include in any executable function
seed – (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a list or pandas.DataFrame

model_dict_column(canonical: ~typing.Any, header: str, convert_str: bool = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → DataFrame

takes a column that contains dict and expands them into columns. Note, the column must be a flat dictionary. Complex structures will not work.

Parameters:

canonical – a pd.DataFrame as the reference dataframe
header – the header of the column to be convert
convert_str – (optional) if the header has the dict as a string convert to dict using ast.literal_eval()
seed – (optional) this is a place holder, here for compatibility across methods
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a pd.DataFrame

model_difference(canonical: ~typing.Any, other: ~typing.Any, on_key: [<class 'str'>, <class 'list'>], drop_zero_sum: bool = None, summary_connector: bool = None, flagged_connector: str = None, detail_connector: str = None, unmatched_connector: str = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

returns the difference between two canonicals, joined on a common and unique key. The on_key parameter can be a direct reference to the canonical column header or to an environment variable. If the environment variable is used on_key should be set to "${<<YOUR_ENVIRON>>}" where <<YOUR_ENVIRON>> is the environment variable name.

If the flagged connector parameter is used, a report flagging mismatched left data with right data is produced for this connector where 1 indicate a difference and 0 they are the same. By default this method returns this report but if this parameter is set the original canonical returned. This allows a canonical pipeline to continue through the component while outputting the difference report.

If the detail connector parameter is used, a detail report of the difference where the left and right values that differ are shown.

If the unmatched connector parameter is used, the on_key’s that don’t match between left and right are reported

Parameters:

canonical – a direct or generated pd.DataFrame. see context notes below
other – a direct or generated pd.DataFrame. to concatenate
on_key – The name of the key that uniquely joins the canonical to others
drop_zero_sum – (optional) drops rows and columns which has a total sum of zero differences
summary_connector – (optional) a connector name where the summary report is sent
flagged_connector – (optional) a connector name where the differences are flagged
detail_connector – (optional) a connector name where the differences are shown
unmatched_connector – (optional) a connector name where the unmatched keys are shown
seed – (optional) this is a placeholder, here for compatibility across methods
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a pd.DataFrame

The other is a pd.DataFrame, a pd.Series or list, a connector contract str reference or a set of parameter instructions on how to generate a pd.Dataframe. the description of each is:

pd.Dataframe -> a deep copy of the pd.DataFrame
pd.Series or list -> creates a pd.DataFrameof one column with the ‘header’ name or ‘default’ if not given
str -> instantiates a connector handler with the connector_name and loads the DataFrame from the connection
dict -> use canonical2dict(…) to help construct a dict with a ‘method’ to build a pd.DataFrame
methods:
model_*(…) -> one of the SyntheticBuilder model methods and parameters

@empty -> generates an empty pd.DataFrame where size and headers can be passed
:size sets the index size of the dataframe :headers any initial headers for the dataframe

@generate -> generate a synthetic file from a remote Domain Contract
:task_name the name of the SyntheticBuilder task to run :repo_uri the location of the Domain Product :size (optional) a size to generate :seed (optional) if a seed should be applied :run_book (optional) if specific intent should be run only

model_drop_outliers(canonical: ~typing.Any, header: str, measure: [<class 'int'>, <class 'float'>] = None, method: str = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

Drops rows in the canonical where the values are deemed outliers based on the method and measure. There are three selectable methods of choice, interquartile or empirical, of which interquartile is the default.

The ‘empirical’ rule states that for a normal distribution, nearly all of the data will fall within three standard deviations of the mean. Given mu and sigma, a simple way to identify outliers is to compute a z-score for every value, which is defined as the number of standard deviations away a value is from the mean. therefor measure given should be the z-score or the number of standard deviations away a value is from the mean. The 68–95–99.7 rule, guide the percentage of values that lie within a band around the mean in a normal distribution with a width of two, four and six standard deviations, respectively and thus the choice of z-score

For the ‘interquartile’ range (IQR), also called the midspread, middle 50%, or H‑spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles of a sample set. The IQR can be used to identify outliers by defining limits on the sample values that are a factor k of the IQR below the 25th percentile or above the 75th percentile. The common value for the factor k is 1.5. A factor k of 3 or more can be used to identify values that are extreme outliers.

param canonical:

a pd.DataFrame as the reference dataframe

param header:

the header in the DataFrame to correlate

param method:

(optional) A method to run to identify outliers. interquartile (default) or empirical

param measure:

(optional) A measure against each method, respectively factor k, z-score, quartile (see above)

param seed:

(optional) the random seed

param save_intent:

(optional) if the intent contract should be saved to the property manager

param column_name:

(optional) the column name that groups intent to create a column

param intent_order:

(optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

param replace_intent:

(optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

param remove_duplicates:

(optional) removes any duplicate intent in any level that is identical

return:

an equal length list of correlated values

model_encode_count(canonical: ~typing.Any, headers: [<class 'str'>, <class 'list'>], prefix=None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → DataFrame

encodes categorical data types, In count encoding we replace the categories by the count of the observations that show that category in the dataset. This techniques capture’s the representation of each label in a dataset, but the encoding may not necessarily be predictive of the outcome.

Parameters:

canonical – a pd.DataFrame as the reference dataframe
headers – the header(s) to apply the encoding
prefix – a str to prefix the column
seed – seed: (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a pd.DataFrame

model_encode_integer(canonical: ~typing.Any, headers: [<class 'str'>, <class 'list'>], ranking: list = None, prefix=None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

Integer encoding replaces the categories by digits from 1 to n, where n is the number of distinct categories of the variable. Integer encoding can be either nominal or orinal.

Nominal data is categorical variables without any particular order between categories. This means that the categories cannot be sorted and there is no natural order between them.

Ordinal data represents categories with a natural, ordered relationship between each category. This means that the categories can be sorted in either ascending or descending order. In order to encode integers as ordinal, a ranking must be provided.

If ranking is given, the return will be ordinal values based on the ranking order of the list. If a categorical value is not found in the list it is grouped with other missing values and given the last ranking.

Parameters:

canonical – a pd.DataFrame as the reference dataframe
headers – the header(s) to apply the encoding
ranking – (optional) if used, ranks the categorical values to the list given
prefix – a str to prefix the column
seed – seed: (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a pd.DataFrame

model_encode_one_hot(canonical: ~typing.Any, headers: [<class 'str'>, <class 'list'>], prefix=None, dtype: ~typing.Any = None, prefix_sep: str = None, dummy_na: bool = False, drop_first: bool = False, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → DataFrame

encodes categorical data types, One hot encoding, consists in encoding each categorical variable with different boolean variables (also called dummy variables) which take values 0 or 1, indicating if a category is present in an observation.

Parameters:

canonical – a pd.DataFrame as the reference dataframe
headers – the header(s) to apply multi-hot
prefix – str, list of str, or dict of str, String to append DataFrame column names, with equal length.
prefix_sep – str separator, default ‘_’
dummy_na – Add a column to indicate null values, if False nullss are ignored.
drop_first – Whether to get k-1 dummies out of k categorical levels by removing the first level.
dtype – Data type for new columns. Only a single dtype is allowed.
seed – seed: (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a pd.DataFrame

model_explode(canonical: ~typing.Any, header: str, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → DataFrame

takes a single column of list values and explodes the DataFrame so row is represented by each elements in the row list

Parameters:

canonical – a direct or generated pd.DataFrame. see context notes below
header – the header of the column to be exploded
seed – (optional) this is a place holder, here for compatibility across methods
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a pd.DataFrame

model_group(canonical: ~typing.Any, group_by: [<class 'str'>, <class 'list'>], headers: [<class 'str'>, <class 'list'>] = None, regex: bool = None, aggregator: str = None, list_choice: int = None, list_max: int = None, drop_group_by: bool = False, seed: int = None, include_weighting: bool = False, freq_precision: int = None, remove_weighting_zeros: bool = False, remove_aggregated: bool = False, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → DataFrame

returns the full column values directly from another connector data source. in addition the the standard groupby aggregators there is also ‘list’ and ‘set’ that returns an aggregated list or set. These can be using in conjunction with ‘list_choice’ and ‘list_size’ allows control of the return values. if list_max is set to 1 then a single value is returned rather than a list of size 1.

Parameters:

canonical – a direct or generated pd.DataFrame. see context notes below
headers – the column headers to apply the aggregation too
group_by – the column headers to group by
regex – if the column headers is q regex
aggregator – (optional) the aggregator as a function of Pandas DataFrame ‘groupby’ or ‘list’ or ‘set’
list_choice – (optional) used in conjunction with list or set aggregator to return a random n choice
list_max – (optional) used in conjunction with list or set aggregator restricts the list to a n size
drop_group_by – (optional) drops the group by headers
include_weighting – (optional) include a percentage weighting column for each
freq_precision – (optional) a precision for the relative_freq values
remove_aggregated – (optional) if used in conjunction with the weighting then drops the aggrigator column
remove_weighting_zeros – (optional) removes zero values
seed – (optional) this is a place holder, here for compatibility across methods
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a pd.DataFrame

model_merge(canonical: ~typing.Any, other: ~typing.Any, left_on: str = None, right_on: str = None, on: str = None, how: str = None, headers: list = None, suffixes: tuple = None, indicator: bool = None, validate: str = None, replace_nulls: bool = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → DataFrame

returns the full column values directly from another connector data source.

Parameters:

canonical – a direct or generated pd.DataFrame. see context notes below
other – a direct or generated pd.DataFrame. see context notes below
left_on – the canonical key column(s) to join on
right_on – the merging dataset key column(s) to join on
on – if th left and right join have the same header name this can replace left_on and right_on
how – (optional) One of ‘left’, ‘right’, ‘outer’, ‘inner’. Defaults to inner. See below for more detailed description of each method.
headers – (optional) a filter on the headers included from the right side
suffixes – (optional) A tuple of string suffixes to apply to overlapping columns. Defaults (‘’, ‘_dup’).
indicator – (optional) Add a column to the output DataFrame called _merge with information on the source of each row. _merge is Categorical-type and takes on a value of left_only for observations whose merge key only appears in ‘left’ DataFrame or Series, right_only for observations whose merge key only appears in ‘right’ DataFrame or Series, and both if the observation’s merge key is found in both.
validate – (optional) validate : string, default None. If specified, checks if merge is of specified type. “one_to_one” or “1:1”: checks if merge keys are unique in both left and right datasets. “one_to_many” or “1:m”: checks if merge keys are unique in left dataset. “many_to_one” or “m:1”: checks if merge keys are unique in right dataset. “many_to_many” or “m:m”: allowed, but does not result in checks.
replace_nulls – (optional) replaces nulls with an appropriate value dependent upon the field type
seed – this is a placeholder, here for compatibility across methods
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a pd.DataFrame

The other is a pd.DataFrame, a pd.Series or list, a connector contract str reference or a set of parameter instructions on how to generate a pd.Dataframe. the description of each is:

pd.Dataframe -> a deep copy of the pd.DataFrame
pd.Series or list -> creates a pd.DataFrame of one column with the ‘header’ name or ‘default’ if not given
str -> instantiates a connector handler with the connector_name and loads the DataFrame from the connection
dict -> use canonical2dict(…) to help construct a dict with a ‘method’ to build a pd.DataFrame
methods:
model_*(…) -> one of the SyntheticBuilder model methods and parameters

@empty -> generates an empty pd.DataFrame where size and headers can be passed
:size sets the index size of the dataframe :headers any initial headers for the dataframe

@generate -> generate a synthetic file from a remote Domain Contract
:task_name the name of the SyntheticBuilder task to run :repo_uri the location of the Domain Product :size (optional) a size to generate :seed (optional) if a seed should be applied :run_book (optional) if specific intent should be run only

model_missing_cca(canonical: ~typing.Any, threshold: float = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → DataFrame

Applies Complete Case Analysis to the canonical. Complete-case analysis (CCA), also called “list-wise deletion” of cases, consists of discarding observations with any missing values. In other words, we only keep observations with data on all the variables. CCA works well when the data is missing completely at random.

Parameters:

canonical – a pd.DataFrame as the reference dataframe
threshold – (optional) a null threshold between 0 and 1 where 1 is all nulls. Default to 0.5
seed – (optional) a placeholder
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a pd.DataFrame

model_modifier(canonical: ~typing.Any, other: ~typing.Any, targets_header: str = None, values_header: str = None, modifier: str = None, seed: int = None, precision: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → DataFrame

Modifies a given set of target header names, within the canonical with the target value for that name. The aggregator indicates the type of modification to be performed. It is assumed the other DataFrame has the target headers as the first column and the target values as the second column, if this is not the case the targets_header and values_handler parameters can be used to specify the other header names.

Parameters:

canonical – a pd.DataFrame as the reference dataframe
other – a direct or generated pd.DataFrame. see context notes below
targets_header – (optional) the name of the target header where the header names are listed
values_header – (optional) The name of the value header where the target values are listed
modifier – (optional) how the value is to be modified. Options are ‘add’, ‘sub’, ‘mul’, ‘div’
precision – (optional) the value precision of the return values
seed – (optional) this is a placeholder, here for compatibility across methods
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a pd.DataFrame

The other is a pd.DataFrame, a pd.Series or list, a connector contract str reference or a set of parameter instructions on how to generate a pd.Dataframe. the description of each is:

pd.Dataframe -> a deep copy of the pd.DataFrame
pd.Series or list -> creates a pd.DataFrame of one column with the ‘header’ name or ‘default’ if not given
str -> instantiates a connector handler with the connector_name and loads the DataFrame from the connection
dict -> use canonical2dict(…) to help construct a dict with a ‘method’ to build a pd.DataFrame
methods:
model_*(…) -> one of the SyntheticBuilder model methods and parameters

@empty -> generates an empty pd.DataFrame where size and headers can be passed
:size sets the index size of the dataframe :headers any initial headers for the dataframe

@generate -> generate a synthetic file from a remote Domain Contract
:task_name the name of the SyntheticBuilder task to run :repo_uri the location of the Domain Product :size (optional) a size to generate :seed (optional) if a seed should be applied :run_book (optional) if specific intent should be run only

model_noise(canonical: ~typing.Any, num_columns: int, inc_targets: bool = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → DataFrame

Generates multiple columns of noise in your dataset

Parameters:

canonical – a direct or generated pd.DataFrame. see context notes below
num_columns – the number of columns of noise
inc_targets – (optional) if a predictor target should be included. default is false
seed – seed: (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a DataFrame

The canonical is a pd.DataFrame, a pd.Series or list, a connector contract str reference or a set of parameter instructions on how to generate a pd.Dataframe. the description of each is:

pd.Dataframe -> a deep copy of the pd.DataFrame
pd.Series or list -> creates a pd.DataFrameof one column with the ‘header’ name or ‘default’ if not given
str -> instantiates a connector handler with the connector_name and loads the DataFrame from the connection
dict -> use canonical2dict(…) to help construct a dict with a ‘method’ to build a pd.DataFrame
methods:
model_*(…) -> one of the SyntheticBuilder model methods and parameters

@empty -> generates an empty pd.DataFrame where size and headers can be passed
:size sets the index size of the dataframe :headers any initial headers for the dataframe

@generate -> generate a synthetic file from a remote Domain Contract
:task_name the name of the SyntheticBuilder task to run :repo_uri the location of the Domain Product :size (optional) a size to generate :seed (optional) if a seed should be applied :run_book (optional) if specific intent should be run only

model_profiling(canonical: ~typing.Any, profiling: str, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, connector_name: str = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None, **kwargs)

Data profiling provides, analyzing, and creating useful summaries of data. The process yields a high-level overview which aids in the discovery of data quality issues, risks, and overall trends. It can be used to identify any errors, anomalies, or patterns that may exist within the data. There are three types of data profiling available ‘dictionary’, ‘schema’ or ‘quality’

If the connector_name is used, it outputs the results to this connector contract and returns the original canonical. This allows a canonical pipeline to continue through the component while outputting the data profile to an alternative path.

Parameters:

canonical – a direct or generated pd.DataFrame. see context notes below
profiling – The profiling name. Options are ‘dictionary’, ‘schema’ or ‘quality’
headers – (optional) a filter of headers from the ‘other’ dataset
drop – (optional) to drop or not drop the headers if specified
dtype – (optional) a filter on data type for the ‘other’ dataset. int, float, bool, object
exclude – (optional) to exclude or include the data types if specified
regex – (optional) a regular expression to search the headers. example ‘^((?!_amt).)*$)’ excludes ‘_amt’
re_ignore_case – (optional) true if the regex should ignore case. Default is False
connector_name – (optional) a connector name where the outcome is sent

:param seed:(optional) this is a placeholder, here for compatibility across methods :param save_intent: (optional) if the intent contract should be saved to the property manager :param column_name: (optional) the column name that groups intent to create a column :param intent_order: (optional) the order in which each intent should run.

If None: default’s to -1

if -1: added to a level above any current instance of the intent section, level 0 if not found

if int: added to the level specified, overwriting any that already exist

Parameters:

replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
kwargs – if using connector_name, any kwargs to pass to the handler

Returns:

a pd.DataFrame

The other is a pd.DataFrame, a pd.Series or list, a connector contract str reference or a set of parameter instructions on how to generate a pd.Dataframe. the description of each is:

pd.Dataframe -> a deep copy of the pd.DataFrame
pd.Series or list -> creates a pd.DataFrameof one column with the ‘header’ name or ‘default’ if not given
str -> instantiates a connector handler with the connector_name and loads the DataFrame from the connection
dict -> use canonical2dict(…) to help construct a dict with a ‘method’ to build a pd.DataFrame
methods:
model_*(…) -> one of the SyntheticBuilder model methods and parameters

@empty -> generates an empty pd.DataFrame where size and headers can be passed
:size sets the index size of the dataframe :headers any initial headers for the dataframe

@generate -> generate a synthetic file from a remote Domain Contract
:task_name the name of the SyntheticBuilder task to run :repo_uri the location of the Domain Product :size (optional) a size to generate :seed (optional) if a seed should be applied :run_book (optional) if specific intent should be run only

model_sample(canonical: ~typing.Any, other: ~typing.Any, headers: list, replace: bool = None, relative_freq: list = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → DataFrame

Takes a target dataset and samples from that target to the size of the canonical

Parameters:

canonical – a pd.DataFrame as the reference dataframe
other – a direct or generated pd.DataFrame. see context notes below
headers – the headers to be selected from the other DataFrame
replace – assuming other is bigger than canonical, selects without replacement when True
relative_freq – (optional) a weighting pattern that does not have to add to 1
seed – (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a DataFrame

model_sample_map(canonical: ~typing.Any, sample_map: str, selection: list = None, headers: [<class 'str'>, <class 'list'>] = None, shuffle: bool = None, rename_columns: dict = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None, **kwargs) → DataFrame

builds a model of a Sample Mapped distribution. To see the sample maps available use the MappedSample class __dir__() method:

> from ds_discovery.sample.sample_data import MappedSample > MappedSample().__dir__()

Parameters:

canonical – a direct or generated pd.DataFrame. see context notes below
sample_map – the sample map name. use MappedSample().__dir__() to get a list of available samples
rename_columns – (optional) rename the columns ‘City’, ‘Zipcode’, ‘State’
selection – (optional) a list of selections where conditions are filtered on, executed in list order An example of a selection with the minimum requirements is: (see ‘select2dict(…)’) [{‘column’: ‘state’, ‘condition’: “isin([‘NY’, ‘TX’]”}]
headers – a header or list of headers to filter on
shuffle – (optional) if the selection should be shuffled before selection. Default is true
seed – seed: (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
kwargs – any additional parameters to pass to the sample map

Returns:

a DataFrame

The canonical is a pd.DataFrame, a pd.Series or list, a connector contract str reference or a set of parameter instructions on how to generate a pd.Dataframe. the description of each is:

pd.Dataframe -> a deep copy of the pd.DataFrame
pd.Series or list -> creates a pd.DataFrameof one column with the ‘header’ name or ‘default’ if not given
str -> instantiates a connector handler with the connector_name and loads the DataFrame from the connection
dict -> use canonical2dict(…) to help construct a dict with a ‘method’ to build a pd.DataFrame
methods:
model_*(…) -> one of the SyntheticBuilder model methods and parameters

@empty -> generates an empty pd.DataFrame where size and headers can be passed
:size sets the index size of the dataframe :headers any initial headers for the dataframe

@generate -> generate a synthetic file from a remote Domain Contract
:task_name the name of the SyntheticBuilder task to run :repo_uri the location of the Domain Product :size (optional) a size to generate :seed (optional) if a seed should be applied :run_book (optional) if specific intent should be run only

model_synthetic_data_types(canonical: int = None, extended: bool = False, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → DataFrame

A dataset with example data types

Parameters:

canonical – the canonical size (rows) of the sample dataset
extended – if the types should extend beyond the standard 6 types including nulls, predominance, etc.
seed – a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

pandas DataSet

model_synthetic_personal_identity(canonical: int = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → DataFrame

A dataset with Personal Identifiable Information

Parameters:

canonical – the canonical size (rows) of the sample dataset
seed – a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

pandas DataSet

model_to_category(canonical: ~typing.Any, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

converts columns to categories

Parameters:

canonical – a pd.DataFrame as the reference dataframe
headers – a list of headers to drop or filter on type
drop – to drop or not drop the headers
dtype – the column types to include or exclude. Default None else int, float, bool, object, ‘number’
exclude – to exclude or include the dtypes
regex – a regular expression to search the headers
re_ignore_case – true if the regex should ignore case. Default is False
seed – (optional) a placeholder
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a pd.DataFrame

model_to_numeric(canonical: ~typing.Any, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, precision: int = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

converts columns to numeric value

Parameters:

canonical – a pd.DataFrame as the reference dataframe
headers – a list of headers to drop or filter on type
drop – to drop or not drop the headers
dtype – the column types to include or exclude. Default None else int, float, bool, object, ‘number’
exclude – to exclude or include the dtypes
regex – a regular expression to search the headers
re_ignore_case – true if the regex should ignore case. Default is False
precision – (optional) an int value of the precision for the float
seed – (optional) a placeholder
save_intent – (optional) if the intent contract should be saved to the property manager
column_name – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a pd.DataFrame

run_intent_pipeline(canonical: ~typing.Any = None, intent_levels: [<class 'str'>, <class 'int'>, <class 'list'>] = None, run_book: str = None, seed: int = None, simulate: bool = None, **kwargs) → DataFrame

Collectively runs all parameterised intent taken from the property manager against the code base as defined by the intent_contract. The whole run can be seeded though any parameterised seeding in the intent contracts will take precedence

Parameters:

canonical – a direct or generated pd.DataFrame. see context notes below
intent_levels – (optional) a single or list of intent_level to run in order given
run_book – (optional) a preset runbook of intent_level to run in order
seed – (optional) a seed value that will be applied across the run: default to None
simulate – (optional) returns a report of the order of run and return the indexed column order of run

Returns:

a pandas dataframe

property sample_lists: list: A list of sample options

property sample_maps: list: A list of sample options

static select2dict(column: str, condition: str, expect: str | None = None, logic: str | None = None, date_format: str | None = None, offset: int | None = None) → dict

a utility method to help build feature conditions by aligning method parameters with dictionary format.

Parameters:

column – the column name to apply the condition to
condition – the condition string (special conditions are ‘date.now’ for current date)
expect – (optional) the data type to expect. If None then the data type is assumed from the dtype
logic – (optional) the logic to provide, see below for options
date_format – (optional) a format of the date if only a specific part of the date and time is required
offset – (optional) a time delta in days (+/-) from the current date and time (minutes not supported)

Returns:

dictionary of the parameters

logic:: AND: the intersect of the current state with the condition result (common to both) NAND: outside the intersect of the current state with the condition result (not common to both) OR: the union of the current state with the condition result (everything in both) NOR: outside the union of the current state with the condition result (everything not in both) NOT: the difference between the current state and the condition result XOR: the difference between the union and the intersect current state with the condition result
extra logic:: ALL: the intersect of the whole index with the condition result irrelevant of level or current state index ANY: the intersect of the level index with the condition result irrelevant of current state index