IntentSyntheticBuild Class

class ds_discovery.intent.synthetic_intent.SyntheticIntentModel(property_manager: ~ds_discovery.managers.synthetic_property_manager.SyntheticPropertyManager, default_save_intent: bool = None, default_intent_level: [<class 'str'>, <class 'int'>, <class 'float'>] = None, order_next_available: bool = None, default_replace_intent: bool = None)

Synthetic data is representative data that, depending on its application, holds statistical and distributive characteristics of its real world counterpart. This component provides a set of actions that focuses on building a synthetic data through knowledge and statistical analysis

static action2dict(method: Any, **kwargs) dict

a utility method to help build feature conditions by aligning method parameters with dictionary format.

Parameters:
  • method – the method to execute

  • kwargs – name value pairs associated with the method

Returns:

dictionary of the parameters

Special method values

@header: use a column as the value reference, expects the ‘header’ key @constant: use a value constant, expects the key ‘value’ @sample: use to get sample values, expected ‘name’ of the Sample method, optional ‘shuffle’ boolean @eval: evaluate a code string, expects the key ‘code_str’ and any locals() required

static canonical2dict(method: Any, **kwargs) dict

a utility method to help build feature conditions by aligning method parameters with dictionary format. The method parameter can be wither a ‘model_*’ or ‘frame_*’ method with two special reserved options

Special reserved method values

@empty: returns an empty dataframe, optionally the key values size: int and headers: list @generate: generates a dataframe either from_env(task_name) o from a remote repo uri. params are

task_name: the task name of the generator repo_uri: (optional) a remote repo to retrieve the the domain contract size: (optional) the generated sample size seed: (optional) if seeding should be applied the seed value run_book: (optional) a domain contract runbook to execute as part of the pipeline

Parameters:
  • method – the method to execute

  • kwargs – name value pairs associated with the method

Returns:

dictionary of the parameters

correlate_activation(canonical: ~typing.Any, header: str, activation: str = None, precision: int = None, seed: int = None, rtn_type: str = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

Activation functions play a crucial role in the backpropagation algorithm, which is the primary algorithm used for training neural networks. During backpropagation, the error of the output is propagated backwards through the network, and the weights of the network are updated based on this error. The activation function is used to introduce non-linearity into the output of a neural network layer.

Logistic Sigmoid a.k.a logit, tmaps any input value to a value between 0 and 1, making it useful for binary classification problems and is defined as f(x) = 1/(1+exp(-x))

Tangent Hyperbolic (tanh) function is a shifted and stretched version of the Sigmoid function but maps the input values to a range between -1 and 1. and is defined as f(x) = (exp(x)-exp(-x))/(exp(x)+exp(-x))

Rectified Linear Unit (ReLU) function. is the most popular activation function, which replaces negative values with zero and keeps the positive values unchanged. and is defined as f(x) = x * (x > 0)

Parameters:
  • canonical – a pd.DataFrame as the reference dataframe

  • header – the header in the DataFrame to correlate

  • activation – (optional) the name of the activation function. Options ‘sigmoid’, ‘tanh’ and ‘relu’

  • precision – (optional) how many decimal places. default to 3

  • seed – (optional) the random seed. defaults to current datetime

  • rtn_type – (optional) changes the default return of a ‘list’ to a pd.Series other than the int, float, category, string and object, passing ‘as-is’ will return as is

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

an equal length list of correlated values

correlate_aggregate(canonical: ~typing.Any, headers: list, agg: str, seed: int = None, rtn_type: str = None, save_intent: bool = None, precision: int = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

correlate two or more columns with each other through a finite set of aggregation functions. The aggregation function names are limited to ‘sum’, ‘prod’, ‘count’, ‘min’, ‘max’ and ‘mean’ for numeric columns and a special ‘list’ function name to combine the columns as a list

Parameters:
  • canonical – a direct or generated pd.DataFrame. see context notes below

  • headers – a list of headers to correlate

  • agg – the aggregation function name enact. The available functions are: ‘sum’, ‘prod’, ‘count’, ‘min’, ‘max’, ‘mean’ and ‘list’ which combines the columns as a list

  • precision – the value precision of the return values

  • seed – (optional) a seed value for the random function: default to None

  • rtn_type – (optional) changes the default return of a ‘list’ to a pd.Series other than the int, float, category, string and object, passing ‘as-is’ will return as is

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a list of equal length to the one passed

correlate_categories(canonical: ~typing.Any, header: str, correlations: list, actions: dict, default_action: [<class 'str'>, <class 'int'>, <class 'float'>, <class 'dict'>] = None, rtn_type: str = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

correlation of a set of values to an action, the correlations must map to the dictionary index values. Note. to use the current value in the passed values as a parameter value pass an empty dict {} as the keys value. If you want the action value to be the current value of the passed value then again pass an empty dict action to be the current value.

simple correlation list:

['A', 'B', 'C'] # if values is 'A' then action is 0 and so on

multiple choice correlation

[['A','B'], 'C'] # if values is 'A' OR 'B' then action is 0 and so on

actions dictionary where the method is a class method followed by its parameters

{0: {'method': 'get_numbers', 'from_value': 0, to_value: 27}}

you can also use the action to specify a specific value:

{0: 'F', 1: {'method': 'get_numbers', 'from_value': 0, to_value: 27}}
Parameters:
  • canonical – a direct or generated pd.DataFrame. see context notes below

  • header – the header in the DataFrame to correlate

  • correlations – a list of categories (can also contain lists for multiple correlations.

  • actions – the correlated set of categories that should map to the index

  • default_action – (optional) a default action to take if the selection is not fulfilled

  • seed – a seed value for the random function: default to None

  • rtn_type – (optional) changes the default return of a ‘list’ to a pd.Series other than the int, float, category, string and object, passing ‘as-is’ will return as is

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a list of equal length to the one passed

Actions are the resulting outcome of the selection (or the default). An action can be just a value or a dict that executes a intent method such as get_number(). To help build actions there is a helper function called action2dict(…) that takes a method as a mandatory attribute.

With actions there are special keyword ‘method’ values:
  • @header: use a column as the value reference, expects the ‘header’ key

  • @constant: use a value constant, expects the key ‘value’

  • @sample: use to get sample values, expected ‘name’ of the Sample method, optional ‘shuffle’ boolean

  • @eval: evaluate a code string, expects the key ‘code_str’ and any locals() required

An example of a simple action to return a selection from a list:

{'method': 'get_category', selection=['M', 'F', 'U']

an example of using the helper method, in this example we use the keyword @header to get a value from another column at the same index position:

inst.action2dict(method="@header", header='value')

We can even execute some sort of evaluation at run time:

inst.action2dict(method="@eval", code_str='sum(values)', values=[1,4,2,1])
correlate_custom(canonical: ~typing.Any, code_str: str, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None, **kwargs)

Commonly used for custom list comprehension, takes code string that when evaluated returns a list of values Before using this method, consider the method correlate_selection(…)

When referencing the canonical in the code_str it should be referenced either by use parameter label ‘canonical’ or the short cut ‘@’ symbol. for example:

code_str = "[x + 2 for x in @['A']]" # where 'A' is a header in the canonical

kwargs can also be passed into the code string but must be preceded by a ‘$’ symbol for example:

code_str = "[True if x == $v1 else False for x in @['A']]" # where 'v1' is a kwargs
Parameters:
  • canonical – a pd.DataFrame as the reference dataframe

  • code_str – an action on those column values. to reference the canonical use ‘@’

  • seed – (optional) a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

  • kwargs – a set of kwargs to include in any executable function

Returns:

value set based on the selection list and the action

correlate_date_diff(canonical: ~typing.Any, first_date: str, second_date: str, units: str = None, precision: int = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None, **kwargs)

returns a column for the difference between a primary and secondary date where the primary is an early date than the secondary.

Parameters:
  • canonical – the DataFrame containing the column headers

  • first_date – the primary or older date field

  • second_date – the secondary or newer date field

  • units – (optional) The Timedelta units e.g. ‘D’, ‘W’, ‘M’, ‘Y’. default is ‘D’

  • precision – the precision of the result

  • seed – (optional) a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

  • kwargs – a set of kwargs to include in any executable function

Returns:

value set based on the selection list and the action

correlate_dates(canonical: ~typing.Any, header: str, choice: [<class 'int'>, <class 'float'>, <class 'str'>] = None, choice_header: str = None, offset: [<class 'int'>, <class 'dict'>, <class 'str'>] = None, jitter: [<class 'int'>, <class 'str'>] = None, jitter_units: str = None, ignore_time: bool = None, ignore_seconds: bool = None, min_date: str = None, max_date: str = None, now_delta: str = None, date_format: str = None, day_first: bool = None, year_first: bool = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

correlate a list of continuous dates adjusting those dates, or a subset of those dates, with a normalised jitter along with a value offset. choice, jitter and offset can accept environment variable string names starting with ${ and ending with }.

When using offset and a dict is passed, the dict should take the form {‘days’: 1}, where the unit is plural, to add 1 day or a singular name {‘hour’: 3}, where the unit is singular, to replace the current with 3 hours. Offsets can be ‘years’, ‘months’, ‘weeks’, ‘days’, ‘hours’, ‘minutes’ or ‘seconds’. If an int is passed days are assumed.

Parameters:
  • canonical – a pd.DataFrame as the reference dataframe

  • header – the header in the DataFrame to correlate

  • choice – (optional) The number of values or percentage between 0 and 1 to choose.

  • choice_header – (optional) those not chosen are given the values of the given header

  • offset – (optional) Temporal parameter that add to or replace the offset value. if int then assume ‘days’

  • jitter – (optional) the random jitter or deviation in days

  • jitter_units – (optional) the units of the jitter, Options: ‘W’, ‘D’, ‘h’, ‘m’, ‘s’. default ‘D’

  • ignore_time – ignore time elements and only select from Year, Month, Day elements. Default is False

  • ignore_seconds – ignore second elements and only select from Year to minute elements. Default is False

  • min_date – (optional)a minimum date not to go below

  • max_date – (optional)a max date not to go above

  • now_delta – (optional) returns a delta from now as an int list, Options: ‘Y’, ‘M’, ‘W’, ‘D’, ‘h’, ‘m’, ‘s’

  • day_first – (optional) if the dates given are day first firmat. Default to True

  • year_first – (optional) if the dates given are year first. Default to False

  • date_format – (optional) the format of the output

  • seed – (optional) a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a list of equal size to that given

correlate_discrete_intervals(canonical: ~typing.Any, header: str, granularity: [<class 'int'>, <class 'float'>, <class 'list'>] = None, lower: [<class 'int'>, <class 'float'>] = None, upper: [<class 'int'>, <class 'float'>] = None, categories: list = None, precision: int = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

converts continuous representation into discrete representation through interval categorisation

Parameters:
  • canonical – a pd.DataFrame as the reference dataframe

  • header – the header in the DataFrame to correlate

  • granularity – (optional) the granularity of the analysis across the range. Default is 3 - int passed - represents the number of periods - float passed - the length of each interval - list[tuple] - specific interval periods e.g [] -list[float] - the percentile or quantities, All should fall between 0 and 1

  • lower – (optional) the lower limit of the number value. Default min()

  • upper – (optional) the upper limit of the number value. Default max()

  • precision – (optional) The precision of the range and boundary values. by default set to 5.

  • categories – (optional) a set of labels the same length as the intervals to name the categories

  • seed – seed: (optional) a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a list of equal length to the one passed

correlate_join(canonical: ~typing.Any, header: str, action: [<class 'str'>, <class 'dict'>], sep: str = None, rtn_type: str = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

correlate a column and join it with the result of the action, This allows for composite values to be build from. an example might be to take a forename and add the surname with a space separator to create a composite name field, of to join two primary keys to create a single composite key.

Parameters:
  • canonical – a direct or generated pd.DataFrame. see context notes below

  • header – an ordered list of columns to join

  • action – (optional) a string or a single action whose outcome will be joined to the header value

  • sep – (optional) a separator between the values

  • seed – (optional) a seed value for the random function: default to None

  • rtn_type – (optional) changes the default return of a ‘list’ to a pd.Series - other than the int, float, category, string and object, passing ‘as-is’ will return as is

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a list of equal length to the one passed

Actions are the resulting outcome of the selection (or the default). An action can be just a value or a dict that executes a intent method such as get_number(). To help build actions there is a helper function called action2dict(…) that takes a method as a mandatory attribute.

With actions there are special keyword ‘method’ values:
  • @header: use a column as the value reference, expects the ‘header’ key

  • @constant: use a value constant, expects the key ‘value’

  • @sample: use to get sample values, expected ‘name’ of the Sample method, optional ‘shuffle’ boolean

  • @eval: evaluate a code string, expects the key ‘code_str’ and any locals() required

An example of a simple action to return a selection from a list:

{'method': 'get_category', selection=['M', 'F', 'U']

an example of using the helper method, in this example we use the keyword @header to get a value from another column at the same index position:

inst.action2dict(method="@header", header='value')

We can even execute some sort of evaluation at run time:

inst.action2dict(method="@eval", code_str='sum(values)', values=[1,4,2,1])
correlate_list_element(canonical: ~typing.Any, header: str, list_size: int = None, random_choice: bool = None, replace: bool = None, shuffle: bool = None, convert_str: bool = None, rtn_type: str = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

correlate a column where the elements of the columns contains a list, and a choice is taken from that list. if the list_size == 1 then a single value is correlated otherwise a list is correlated

Null values are passed through but all other elements must be a list with at least 1 value in.

if ‘random’ is true then all returned values will be a random selection from the list and of equal length. if ‘random’ is false then each list will not exceed the ‘list_size’

Also if ‘random’ is true and ‘replace’ is False then all lists must have more elements than the list_size. By default ‘replace’ is True and ‘shuffle’ is False.

In addition ‘convert_str’ allows lists that have been formatted as a string can be converted from a string to a list using ‘ast.literal_eval(x)’

Parameters:
  • canonical – a direct or generated pd.DataFrame. see context notes below

  • header – The header containing a list to chose from.

  • list_size – (optional) the number of elements to return, if more than 1 then list

  • random_choice – (optional) if the choice should be a random choice.

  • replace – (optional) if the choice selection should be replaced or selected only once

  • shuffle – (optional) if the final list should be shuffled

  • convert_str – if the header has the list as a string convert to list using ast.literal_eval()

  • seed – (optional) a seed value for the random function: default to None

  • rtn_type – (optional) changes the default return of a ‘list’ to a pd.Series other than the int, float, category, string and object, passing ‘as-is’ will return as is

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a list of equal length to the one passed

correlate_mark_outliers(canonical: ~typing.Any, header: str, measure: [<class 'int'>, <class 'float'>] = None, method: str = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

Drops rows in the canonical where the values are deemed outliers based on the method and measure. There are three selectable methods of choice, interquartile or empirical, of which interquartile is the default.

The ‘empirical’ rule states that for a normal distribution, nearly all of the data will fall within three standard deviations of the mean. Given mu and sigma, a simple way to identify outliers is to compute a z-score for every value, which is defined as the number of standard deviations away a value is from the mean. therefor measure given should be the z-score or the number of standard deviations away a value is from the mean. The 68–95–99.7 rule, guide the percentage of values that lie within a band around the mean in a normal distribution with a width of two, four and six standard deviations, respectively and thus the choice of z-score

For the ‘interquartile’ range (IQR), also called the midspread, middle 50%, or H‑spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles of a sample set. The IQR can be used to identify outliers by defining limits on the sample values that are a factor k of the IQR below the 25th percentile or above the 75th percentile. The common value for the factor k is 1.5. A factor k of 3 or more can be used to identify values that are extreme outliers.

param canonical:

a pd.DataFrame as the reference dataframe

param header:

the header in the DataFrame to correlate

param method:

(optional) A method to run to identify outliers. interquartile (default) or empirical

param measure:

(optional) A measure against each method, respectively factor k, z-score, quartile (see above)

param seed:

(optional) the random seed

param save_intent:

(optional) if the intent contract should be saved to the property manager

param column_name:

(optional) the column name that groups intent to create a column

param intent_order:

(optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

param replace_intent:

(optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

param remove_duplicates:

(optional) removes any duplicate intent in any level that is identical

return:

an equal length list of correlated values

correlate_missing(canonical: ~typing.Any, header: str, method: str = None, weights: str = None, constant: ~typing.Any = None, precision: int = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

imputes missing data with statistical estimates of the missing values. The methods are ‘mean’, ‘median’, ‘mode’ and ‘random’ with the addition of ‘constant’ and ‘indicator’

Mean/median imputation consists of replacing all occurrences of missing values (NA) within a variable by the mean (if the variable has a Gaussian distribution) or median (if the variable has a skewed distribution). Can only be applied to numeric values.

Mode imputation consists of replacing all occurrences of missing values (NA) within a variable by the mode, which is the most frequent value or most frequent category. Can be applied to both numerical and categorical variables.

Random sampling imputation is in principle similar to mean, median, and mode imputation in that it considers that missing values, should look like those already existing in the distribution. Random sampling consists of taking random observations from the pool of available data and using them to replace the NA. In random sample imputation, we take as many random observations as missing values exist in the variable. Can be applied to both numerical and categorical variables.

Neighbour imputation is for filling in missing values using the k-Nearest Neighbors approach. Each missing feature is imputed using values from five nearest neighbors that have a value for the feature. The feature of the neighbors are averaged uniformly or weighted by distance to each neighbor. If a sample has more than one feature missing, then the neighbors for that sample can be different depending on the particular feature being imputed. When the number of available neighbors is less than five the average for that feature is used during imputation. If there is at least one neighbor with a defined distance, the weighted or unweighted average of the remaining neighbors will be used during imputation.

Constant or Arbitrary value imputation consists of replacing all occurrences of missing values (NA) with an arbitrary constant value. Can be applied to both numerical and categorical variables. A value must be passed in the constant parameter relevant to the column type.

Indicator is not an imputation method but imputation techniques, such as mean, median and random will affect the variable distribution quite dramatically and is a good idea to flag them with a missing indicator. This must be done before imputation of the column.

Parameters:
  • canonical – a pd.DataFrame as the reference dataframe

  • header – the header in the DataFrame to correlate

  • method – (optional) the replacement method, ‘mean’, ‘median’, ‘mode’, ‘constant’, ‘random’, ‘indicator’

  • weights – (optional) Weight function used in prediction of nearest neighbour if used as method. Options ‘uniform’ : uniform weights. All points in each neighborhood are weighted equally. ‘distance’ : weight points by the inverse of their distance.

  • constant – (optional) a value to us when the method is constant

  • precision – (optional) if numeric, the precision of the outcome, by default set to 3.

  • seed – (optional) the random seed. defaults to current datetime

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

an equal length list of correlated values

correlate_numbers(canonical: ~typing.Any, header: str, standardize: bool = None, normalize: bool = None, scalar: tuple = None, transform: str = None, precision: int = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

Allows for the scaling transformation of a continuous value set. scaling methods. Thse techniques are used to alter the values of a variable so that they are expressed on a common scale. This is often done to make it easier to compare different variables or to make it easier to analyze data.

Parameters:
  • canonical – a pd.DataFrame as the reference dataframe

  • header – the header in the DataFrame to correlate

  • standardize – (optional) standardise continuous variables with mean 0 and std 1

  • normalize – (optional) normalize continuous variables between 0 an 1.

  • scalar – (optional) scales continuous variables between a mix and max value passed in the tuple pair.

  • transform – (optional) attempts normal distribution of continuous variables. options are log, sqrt, cbrt, boxcox, yeojohnson

  • precision – (optional) how many decimal places. default to 3

  • seed – (optional) the random seed. defaults to current datetime

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

an equal length list of correlated values

Returns:

an equal length list of correlated values

The offset can be a numeric offset that is added to the value, e.g. passing 2 will add 2 to all values. If a string is passed if format should be a calculation with the ‘@’ character used to represent the column value. e.g.

'1-@' would subtract the column value from 1,
'@*0.5' would multiply the column value by 0.5
correlate_polynomial(canonical: ~typing.Any, header: str, coefficient: list, rtn_type: str = None, seed: int = None, keep_zero: bool = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

creates a polynomial using the reference header values and apply the coefficients where the index of the list represents the degree of the term in reverse order.

e.g [6, -2, 0, 4] => f(x) = 4x**3 - 2x + 6

Parameters:
  • canonical – a direct or generated pd.DataFrame. see context notes below

  • header – the header in the DataFrame to correlate

  • coefficient – the reverse list of term coefficients

  • seed – (optional) the random seed. defaults to current datetime

  • rtn_type – (optional) changes the default return of a ‘list’ to a pd.Series other than the int, float, category, string and object, passing ‘as-is’ will return as is

  • keep_zero – (optional) if True then zeros passed remain zero, Default is False

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

an equal length list of correlated values

correlate_selection(canonical: ~typing.Any, selection: list, action: [<class 'str'>, <class 'int'>, <class 'float'>, <class 'dict'>], default_action: [<class 'str'>, <class 'int'>, <class 'float'>, <class 'dict'>] = None, seed: int = None, rtn_type: str = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

returns a value set based on the selection list and the action enacted on that selection. If the selection criteria is not fulfilled then the default_action is taken if specified, else null value.

If a DataFrame is not passed, the values column is referenced by the header ‘_default’

Parameters:
  • canonical – a direct or generated pd.DataFrame. see context notes below

  • selection – a list of selections where conditions are filtered on, executed in list order

An example of a selection with the minimum requirements is: (see ‘select2dict(…)’)

[{'column': 'genre', 'condition': "=='Comedy'"}]
Parameters:

action – a value or dict to act upon if the select is successful. see below for more examples

An example of an action as a dict: (see ‘action2dict(…)’)

{'method': 'get_category', 'selection': ['M', 'F', 'U']}
Parameters:
  • default_action – (optional) a default action to take if the selection is not fulfilled

  • seed – (optional) a seed value for the random function: default to None

  • rtn_type – (optional) changes the default return of a ‘list’ to a pd.Series - other than the int, float, category, string and object, passing ‘as-is’ will return as is

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

value set based on the selection list and the action

Selections are a list of dictionaries of conditions and optional additional parameters to filter. To help build conditions there is a static helper method called ‘select2dict(…)’ that has parameter options available to build a condition. An example of a condition with the minimum requirements is [{‘column’: ‘genre’, ‘condition’: “==’Comedy’”}]

an example of using the helper method

selection = [inst.select2dict(column='gender', condition="=='M'"),
             inst.select2dict(column='age', condition=">65", logic='XOR')]

Using the ‘select2dict’ method ensure the correct keys are used and the dictionary is properly formed. It also helps with building the logic that is executed in order

Actions are the resulting outcome of the selection (or the default). An action can be just a value or a dict that executes a intent method such as get_number(). To help build actions there is a helper function called action2dict(…) that takes a method as a mandatory attribute.

With actions there are special keyword ‘method’ values:
  • @header: use a column as the value reference, expects the ‘header’ key

  • @constant: use a value constant, expects the key ‘value’

  • @sample: use to get sample values, expected ‘name’ of the Sample method, optional ‘shuffle’ boolean

  • @eval: evaluate a code string, expects the key ‘code_str’ and any locals() required

An example of a simple action to return a selection from a list:

{'method': 'get_category', selection: ['M', 'F', 'U']}

This same action using the helper method would look like:

inst.action2dict(method='get_category', selection=['M', 'F', 'U'])

an example of using the helper method, in this example we use the keyword @header to get a value from another column at the same index position:

inst.action2dict(method="@header", header='value')

We can even execute some sort of evaluation at run time:

inst.action2dict(method="@eval", code_str='sum(values)', values=[1,4,2,1])
correlate_values(canonical: ~typing.Any, header: str, choice: [<class 'int'>, <class 'float'>, <class 'str'>] = None, choice_header: str = None, precision: int = None, jitter: [<class 'int'>, <class 'float'>, <class 'str'>] = None, offset: [<class 'int'>, <class 'float'>, <class 'str'>] = None, transform: ~typing.Any = None, lower: [<class 'int'>, <class 'float'>] = None, upper: [<class 'int'>, <class 'float'>] = None, keep_zero: bool = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) list

correlate a list of continuous values adjusting those values, or a subset of those values, with a normalised jitter (std from the value) along with a value offset. choice, jitter and offset can accept environment variable string names starting with ${ and ending with }.

Parameters:
  • canonical – a pd.DataFrame as the reference dataframe

  • header – the header in the DataFrame to correlate

  • choice – (optional) The number of values to choose to apply the change to. Can be an environment variable.

  • choice_header – (optional) those not chosen are given the values of the given header

  • precision – (optional) to what precision the return values should be

  • offset – (optional) a fixed value to offset or if str an operation to perform using @ as the header value.

  • transform – (optional) passing a lambda function to transform the value. e.g. lambda x: (x - 3) / 2

  • jitter – (optional) a perturbation of the value where the jitter is a random normally distributed std

  • precision – (optional) how many decimal places. default to 3

  • seed – (optional) the random seed. defaults to current datetime

  • keep_zero – (optional) if True then zeros passed remain zero despite a change, Default is False

  • lower – a minimum value not to go below

  • upper – a max value not to go above

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

an equal length list of correlated values

frame_selection(canonical: ~typing.Any, selection: list = None, choice: int = None, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) DataFrame

Selects rows and/or columns changing the shape of the DatFrame. This is always run last in a pipeline Rows are filtered before the column filter so columns can be referenced even though they might not be included the final column list.

Parameters:
  • canonical – a direct or generated pd.DataFrame. see context notes below

  • selection – a list of selections where conditions are filtered on, executed in list order

An example of a selection with the minimum requirements is: (see ‘select2dict(…)’)

[{'column': 'genre', 'condition': "=='Comedy'"}]
Parameters:
  • choice – a number of rows to select, randomly selected from the index

  • headers – a list of headers to drop or filter on type

  • drop – to drop or not drop the headers

  • dtype – the column types to include or excluse. Default None else int, float, bool, object, ‘number’

  • exclude – to exclude or include the dtypes

  • regex – a regular expression to search the headers. example ‘^((?!_amt).)*$)’ excludes ‘_amt’ columns

  • re_ignore_case – true if the regex should ignore case. Default is False

  • seed – this is a place holder, here for compatibility across methods

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

pd.DataFrame

Selections are a list of dictionaries of conditions and optional additional parameters to filter. To help build conditions there is a static helper method called ‘select2dict(…)’ that has parameter options available to build a condition. An example of a condition with the minimum requirements is [{‘column’: ‘genre’, ‘condition’: “==’Comedy’”}]

an example of using the helper method

selection = [inst.select2dict(column='gender', condition="=='M'"),
             inst.select2dict(column='age', condition=">65", logic='XOR')]

Using the ‘select2dict’ method ensure the correct keys are used and the dictionary is properly formed. It also helps with building the logic that is executed in order

frame_starter(canonical: ~typing.Any, selection: list = None, choice: int = None, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, rename_map: dict = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) DataFrame

Selects rows and/or columns changing the shape of the DatFrame. This is always run first in a pipeline Rows are filtered before columns are filtered so columns can be referenced even though they might not be included in the final column list.

Parameters:
  • canonical – a direct or generated pd.DataFrame. see context notes below

  • choice – a number of rows to select, randomly selected from the index

  • selection – a list of selections where conditions are filtered on, executed in list order

An example of a selection with the minimum requirements is: (see ‘select2dict(…)’)

[{'column': 'genre', 'condition': "=='Comedy'"}]
Parameters:
  • headers – a list of headers to drop or filter on type

  • drop – to drop or not drop the headers

  • dtype – the column types to include or excluse. Default None else int, float, bool, object, ‘number’

  • exclude – to exclude or include the dtypes

  • regex – a regular expression to search the headers. example ‘^((?!_amt).)*$)’ excludes ‘_amt’ columns

  • re_ignore_case – true if the regex should ignore case. Default is False

  • seed – this is a place holder, here for compatibility across methods

  • rename_map – a from: to dictionary of headers to rename

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

pd.DataFrame

The canonical is a pd.DataFrame, a pd.Series or list, a connector contract str reference or a set of parameter instructions on how to generate a pd.Dataframe. the description of each is:

  • pd.Dataframe -> a deep copy of the pd.DataFrame

  • pd.Series or list -> creates a pd.DataFrameof one column with the ‘header’ name or ‘default’ if not given

  • str -> instantiates a connector handler with the connector_name and loads the DataFrame from the connection

  • int -> generates an empty pd.Dataframe with an index size of the int passed.

  • dict -> use canonical2dict(…) to help construct a dict with a ‘method’ to build a pd.DataFrame

- model_*(...) -> one of the SyntheticBuilder model methods and parameters
- @empty -> generates an empty pd.DataFrame where size and headers can be passed

:size sets the index size of the dataframe :headers any initial headers for the dataframe

- @generate -> generate a synthetic file from a remote Domain Contract

:task_name the name of the SyntheticBuilder task to run :repo_uri the location of the Domain Product :size (optional) a size to generate :seed (optional) if a seed should be applied :run_book (optional) if specific intent should be run only

Selections are a list of dictionaries of conditions and optional additional parameters to filter. To help build conditions there is a static helper method called ‘select2dict(…)’ that has parameter options available to build a condition. An example of a condition with the minimum requirements is [{‘column’: ‘genre’, ‘condition’: “==’Comedy’”}]

an example of using the helper method

selection = [inst.select2dict(column='gender', condition="=='M'"),
             inst.select2dict(column='age', condition=">65", logic='XOR')]

Using the ‘select2dict’ method ensure the correct keys are used and the dictionary is properly formed. It also helps with building the logic that is executed in order

get_category(selection: list, relative_freq: list = None, quantity: float = None, size: int = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) list

returns a category from a list. Of particular not is the at_least parameter that allows you to control the number of times a selection can be chosen.

Parameters:
  • selection – a list of items to select from

  • relative_freq – a weighting pattern that does not have to add to 1

  • quantity – a number between 0 and 1 representing the percentage quantity of the data

  • size – an optional size of the return. default to 1

  • seed – a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

an item or list of items chosen from the list

get_datetime(start: ~typing.Any, until: ~typing.Any, relative_freq: list = None, at_most: int = None, ordered: str = None, date_format: str = None, as_num: bool = None, ignore_time: bool = None, ignore_seconds: bool = None, size: int = None, quantity: float = None, seed: int = None, day_first: bool = None, year_first: bool = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) list

returns a random date between two date and/or times. weighted patterns can be applied to the overall date range. if a signed ‘int’ type is passed to the start and/or until dates, the inferred date will be the current date time with the integer being the offset from the current date time in ‘days’.

Note: If no patterns are set this will return a linearly random number between the range boundaries.

Parameters:
  • start – the start boundary of the date range can be str, datetime, pd.datetime, pd.Timestamp or int

  • until – up until boundary of the date range can be str, datetime, pd.datetime, pd.Timestamp or int

  • quantity – the quantity of values that are not null. Number between 0 and 1

  • relative_freq – (optional) A pattern across the whole date range.

  • at_most – the most times a selection should be chosen

  • ordered – order the data ascending ‘asc’ or descending ‘dec’, values accepted ‘asc’ or ‘des’

  • ignore_time – ignore time elements and only select from Year, Month, Day elements. Default is False

  • ignore_seconds – ignore second elements and only select from Year to minute elements. Default is False

  • date_format – the string format of the date to be returned. if not set then pd.Timestamp returned

  • as_num – returns a list of Matplotlib date values as a float. Default is False

  • size – the size of the sample to return. Default to 1

  • seed – a seed value for the random function: default to None

  • year_first – specifies if to parse with the year first - If True parses dates with the year first, e.g. 10/11/12 is parsed as 2010-11-12. - If both dayfirst and yearfirst are True, yearfirst is preceded (same as dateutil).

  • day_first – specifies if to parse with the day first - If True, parses dates with the day first, eg %d-%m-%Y. - If False default to a preferred preference, normally %m-%d-%Y (but not strict)

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a date or size of dates in the format given.

get_dist_bernoulli(probability: float, size: int = None, quantity: float = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) list

A Bernoulli discrete random distribution using scipy

Parameters:
  • probability – the probability occurrence

  • size – the size of the sample

  • quantity – a number between 0 and 1 representing data that isn’t null

  • seed – a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a random number

get_dist_bounded_normal(mean: float, std: float, lower: float, upper: float, precision: int = None, size: int = None, quantity: float = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) list

A bounded normal continuous random distribution.

Parameters:
  • mean – the mean of the distribution

  • std – the standard deviation

  • lower – the lower limit of the distribution

  • upper – the upper limit of the distribution

  • precision – the precision of the returned number. if None then assumes int value else float

  • size – the size of the sample

  • quantity – a number between 0 and 1 representing data that isn’t null

  • seed – a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a random number

get_dist_choice(number: [<class 'int'>, <class 'str'>, <class 'float'>], size: int = None, quantity: float = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) list
Creates a list of latent values of 0 or 1 where 1 is randomly selected both upon the number given. The

number parameter can be a direct reference to the canonical column header or to an environment variable. If the environment variable is used number should be set to "${<<YOUR_ENVIRON>>}" where <<YOUR_ENVIRON>> is the environment variable name

Parameters:
  • number – The number of true (1) values to randomly chose from the canonical. see below

  • size – the size of the sample. if a tuple of intervals, size must match the tuple

  • quantity – a number between 0 and 1 representing data that isn’t null

  • seed – a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. If None: default’s to -1 if -1: added to a level above any current instance of the intent section, level 0 if not found if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level True - replaces the current intent method with the new False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a list of 1 or 0

as choice is a fixed value, number can be represented by an environment variable with the format ‘${NAME}’ where NAME is the environment variable name

get_dist_normal(mean: float, std: float, precision: int = None, size: int = None, quantity: float = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) list

A normal (Gaussian) continuous random distribution.

Parameters:
  • mean – The mean (“centre”) of the distribution.

  • std – The standard deviation (jitter or “width”) of the distribution. Must be >= 0

  • precision – The number of decimal points. The default is 3

  • size – the size of the sample. if a tuple of intervals, size must match the tuple

  • quantity – a number between 0 and 1 representing data that isn’t null

  • seed – a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a random number

get_distribution(distribution: str, is_stats: bool = None, precision: int = None, size: int = None, quantity: float = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None, **kwargs) list

returns a number based the distribution type.

Parameters:
  • distribution – The string name of the distribution function from numpy random Generator class

  • is_stats – (optional) if the generator is from the stats package and not numpy

  • precision – (optional) the precision of the returned number

  • size – (optional) the size of the sample

  • quantity – (optional) a number between 0 and 1 representing data that isn’t null

  • seed – (optional) a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

  • kwargs – the parameters of the method

Returns:

a random number

get_intervals(intervals: list, relative_freq: list = None, precision: int = None, size: int = None, quantity: float = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) list

returns a number based on a list selection of tuple(lower, upper) interval

Parameters:
  • intervals – a list of unique tuple pairs representing the interval lower and upper boundaries

  • relative_freq – a weighting pattern or probability that does not have to add to 1

  • precision – the precision of the returned number. if None then assumes int value else float

  • size – the size of the sample

  • quantity – a number between 0 and 1 representing data that isn’t null

  • seed – a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a random number

get_number(from_value: [<class 'int'>, <class 'float'>, <class 'str'>] = None, to_value: [<class 'int'>, <class 'float'>, <class 'str'>] = None, relative_freq: list = None, precision: int = None, ordered: str = None, at_most: int = None, size: int = None, quantity: float = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) list

returns a number in the range from_value to to_value. if only to_value given from_value is zero

Parameters:
  • from_value – (signed) integer or float to start from. See below for str

  • to_value – optional, (signed) integer or float the number sequence goes to but not include. See below

  • relative_freq – a weighting pattern or probability that does not have to add to 1

  • precision – the precision of the returned number. if None then assumes int value else float

  • ordered – order the data ascending ‘asc’ or descending ‘dec’, values accepted ‘asc’ or ‘des’

  • at_most – the most times a selection should be chosen

  • size – the size of the sample

  • quantity – a number between 0 and 1 representing data that isn’t null

  • seed – a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a random number

The values can be represented by an environment variable with the format ‘${NAME}’ where NAME is the environment variable name

get_sample(sample_name: str, sample_size: int = None, shuffle: bool = None, size: int = None, quantity: float = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

returns a sample set based on sector and name To see the sample sets available use the Sample class __dir__() method:

> from ds_discovery.sample.sample_data import Sample > Sample().__dir__()

Parameters:
  • sample_name – The name of the Sample method to be used.

  • sample_size – (optional) the size of the sample to take from the reference file

  • shuffle – (optional) if the selection should be shuffled before selection. Default is true

  • quantity – (optional) a number between 0 and 1 representing the percentage quantity of the data

  • size – (optional) size of the return. default to 1

  • seed – (optional) a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a sample list

get_selection(select_source: str, column_header: str, relative_freq: list = None, sample_size: int = None, selection_size: int = None, size: int = None, shuffle: bool = None, quantity: float = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) list

returns a random list of values where the selection of those values is taken from a connector source.

Parameters:
  • select_source – the selection source for the reference dataframe

  • column_header – the name of the column header to correlate

  • relative_freq – (optional) a weighting pattern of the final selection

  • selection_size – (optional) the selection to take from the sample size, normally used with shuffle

  • sample_size – (optional) the size of the sample to take from the reference file

  • shuffle – (optional) if the selection should be shuffled before selection. Default is true

  • quantity – (optional) a number between 0 and 1 representing the percentage quantity of the data

  • size – (optional) size of the return. default to 1

  • seed – (optional) a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

list

The select_source is normally a connector contract str reference or a set of parameter instructions on how to generate a pd.Dataframe but can be a pd.DataFrame. the description of each is:

  • pd.Dataframe -> a deep copy of the pd.DataFrame

  • pd.Series or list -> creates a pd.DataFrameof one column with the ‘header’ name or ‘default’ if not given

  • str -> instantiates a connector handler with the connector_name and loads the DataFrame from the connection

  • int -> generates an empty pd.Dataframe with an index size of the int passed.

  • dict -> use canonical2dict(…) to help construct a dict with a ‘method’ to build a pd.DataFrame
    methods:
    • model_*(…) -> one of the SyntheticBuilder model methods and parameters

    • @empty -> generates an empty pd.DataFrame where size and headers can be passed

      :size sets the index size of the dataframe :headers any initial headers for the dataframe

    • @generate -> generate a synthetic file from a remote Domain Contract

      :task_name the name of the SyntheticBuilder task to run :repo_uri the location of the Domain Product :size (optional) a size to generate :seed (optional) if a seed should be applied :run_book (optional) if specific intent should be run only

get_string_pattern(pattern: str, choices: dict = None, as_binary: bool = None, quantity: [<class 'float'>, <class 'int'>] = None, size: int = None, choice_only: bool = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) list

Returns a random string based on the pattern given. The pattern is made up from the choices passed but by default is as follows:

  • c = random char [a-z][A-Z]

  • d = digit [0-9]

  • l = lower case char [a-z]

  • U = upper case char [A-Z]

  • p = all punctuation

  • s = space

you can also use punctuation in the pattern that will be retained A pattern example might be

uuddsduu => BA12 2NE or dl-{uu} => 4g-{FY}

to create your own choices pass a dictionary with a reference char key with a list of choices as a value

Parameters:
  • pattern – the pattern to create the string from

  • choices – (optional) an optional dictionary of list of choices to replace the default.

  • as_binary – (optional) if the return string is prefixed with a b

  • quantity – (optional) a number between 0 and 1 representing the percentage quantity of the data

  • size – (optional) the size of the return list. if None returns a single value

  • choice_only – (optional) if to only use the choices given or to take not found characters as is

  • seed – (optional) a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a string based on the pattern

get_tagged_pattern(pattern: [<class 'str'>, <class 'list'>], tags: dict, relative_freq: list = None, size: int = None, quantity: [<class 'float'>, <class 'int'>] = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) list

Returns the pattern with the tags substituted by tag choice example ta dictionary:

{ '<slogan>': {'action': '', 'kwargs': {}},
  '<phone>': {'action': '', 'kwargs': {}}
}

where action is a method name and kwargs are the arguments to pass for sample data use that method

Parameters:
  • pattern – a string or list of strings to apply the ta substitution too

  • tags – a dictionary of tas and actions

  • relative_freq – a weighting pattern that does not have to add to 1

  • quantity – a number between 0 and 1 representing the percentage quantity of the data

  • size – an optional size of the return. default to 1

  • seed – a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a list of patterns with tas replaced

get_uuid(version: int = None, as_hex: bool = None, size: int = None, quantity: float = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None, **kwargs) list

returns a list of UUID’s based on the version presented. By default the uuid version is 4. optional parameters for the version number UUID generator can be passed as kwargs.

Version 1: Generate a UUID from a host ID, sequence number, and the current time. Note as uuid1

contains the computers network address it may compromise privacy
  • param node: (optional) used instead of getnode() which returns a hardware address

  • param clock_seq: (optional) used as a sequence number alternative

Version 3: Generate a UUID based on the MD5 hash of a namespace identifier and a name
  • param namespace: an alternative namespace as a UUID e.g. uuid.NAMESPACE_DNS

  • param name: a string name

Version 4: Generate a random UUID

Version 5: Generate a UUID based on the SHA-1 hash of a namespace identifier and name
  • param namespace: an alternative namespace as a UUID e.g. uuid.NAMESPACE_DNS

  • param name: a string name

Parameters:
  • version – The version of the UUID to use. 1, 3, 4 or 5

  • as_hex – if the return value is in hex format, else as a string

  • size – the size of the sample. Must be smaller than the range

  • quantity – a number between 0 and 1 representing the percentage quantity of the data

  • seed – a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a unique identifier randomly selected from the range

model_analysis(canonical: ~typing.Any, other: ~typing.Any, columns_list: list = None, exclude_associate: list = None, detail_numeric: bool = None, strict_typing: bool = None, category_limit: int = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) DataFrame

builds a set of columns based on an other (see analyse_association) if a reference DataFrame is passed then as the analysis is run if the column already exists the row value will be taken as the reference to the sub category and not the random value. This allows already constructed association to be used as reference for a sub category.

Parameters:
  • canonical – a pd.DataFrame as the reference dataframe

  • other – a direct or generated pd.DataFrame. see context notes below

  • columns_list – (optional) a list structure of columns to select for association

  • exclude_associate – (optional) a list of dot separated tree of items to exclude from iteration (e.g. [‘age.gender.salary’]

  • detail_numeric – (optional) as a default, if numeric columns should have detail stats, slowing analysis

  • strict_typing – (optional) stops objects and string types being seen as categories

  • category_limit – (optional) a global cap on categories captured. zero value returns no limits

  • seed – seed: (optional) a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a DataFrame

The other is a pd.DataFrame, a pd.Series, int or list, a connector contract str reference or a set of parameter instructions on how to generate a pd.Dataframe. the description of each is:

  • pd.Dataframe -> a deep copy of the pd.DataFrame

  • pd.Series or list -> creates a pd.DataFrame of one column with the ‘header’ name or ‘default’ if not given

  • str -> instantiates a connector handler with the connector_name and loads the DataFrame from the connection

  • int -> generates an empty pd.Dataframe with an index size of the int passed.

  • dict -> use canonical2dict(…) to help construct a dict with a ‘method’ to build a pd.DataFrame
    methods:
    • model_*(…) -> one of the SyntheticBuilder model methods and parameters

    • @empty -> generates an empty pd.DataFrame where size and headers can be passed

      :size sets the index size of the dataframe :headers any initial headers for the dataframe

    • @generate -> generate a synthetic file from a remote Domain Contract

      :task_name the name of the SyntheticBuilder task to run :repo_uri the location of the Domain Product :size (optional) a size to generate :seed (optional) if a seed should be applied :run_book (optional) if specific intent should be run only

model_concat(canonical: ~typing.Any, other: ~typing.Any, as_rows: bool = None, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, shuffle: bool = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) DataFrame

returns the full column values directly from another connector data source.

Parameters:
  • canonical – a direct or generated pd.DataFrame. see context notes below

  • other – a direct or generated pd.DataFrame. to concatenate

  • as_rows – (optional) how to concatenate, True adds the connector dataset as rows, False as columns

  • headers – (optional) a filter of headers from the ‘other’ dataset

  • drop – (optional) to drop or not drop the headers if specified

  • dtype – (optional) a filter on data type for the ‘other’ dataset. int, float, bool, object

  • exclude – (optional) to exclude or include the data types if specified

  • regex – (optional) a regular expression to search the headers. example ‘^((?!_amt).)*$)’ excludes ‘_amt’

  • re_ignore_case – (optional) true if the regex should ignore case. Default is False

  • shuffle – (optional) if the rows in the loaded canonical should be shuffled

  • seed – this is a place holder, here for compatibility across methods

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a pd.DataFrame

The other is a pd.DataFrame, a pd.Series or list, a connector contract str reference or a set of parameter instructions on how to generate a pd.Dataframe. the description of each is:

  • pd.Dataframe -> a deep copy of the pd.DataFrame

  • pd.Series or list -> creates a pd.DataFrameof one column with the ‘header’ name or ‘default’ if not given

  • str -> instantiates a connector handler with the connector_name and loads the DataFrame from the connection

  • dict -> use canonical2dict(…) to help construct a dict with a ‘method’ to build a pd.DataFrame
    methods:
    • model_*(…) -> one of the SyntheticBuilder model methods and parameters

    • @empty -> generates an empty pd.DataFrame where size and headers can be passed

      :size sets the index size of the dataframe :headers any initial headers for the dataframe

    • @generate -> generate a synthetic file from a remote Domain Contract

      :task_name the name of the SyntheticBuilder task to run :repo_uri the location of the Domain Product :size (optional) a size to generate :seed (optional) if a seed should be applied :run_book (optional) if specific intent should be run only

model_custom(canonical: ~typing.Any, code_str: str, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None, **kwargs)

Commonly used for custom methods, takes code string that when executed changes the canonical returning the modified canonical. If the method passes returns a pd.Dataframe this will be returned else the assumption is the canonical has been changed inplace and thus the modified canonical will be returned When referencing the canonical in the code_str it should be referenced either by use parameter label ‘canonical’ or the short cut ‘@’ symbol. kwargs can also be passed into the code string but must be preceded by a ‘$’ symbol

Parameters:
  • canonical – a direct or generated pd.DataFrame. see context notes below

  • code_str – an action on those column values

  • kwargs – a set of kwargs to include in any executable function

  • seed – (optional) a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a list or pandas.DataFrame

model_dict_column(canonical: ~typing.Any, header: str, convert_str: bool = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) DataFrame

takes a column that contains dict and expands them into columns. Note, the column must be a flat dictionary. Complex structures will not work.

Parameters:
  • canonical – a pd.DataFrame as the reference dataframe

  • header – the header of the column to be convert

  • convert_str – (optional) if the header has the dict as a string convert to dict using ast.literal_eval()

  • seed – (optional) this is a place holder, here for compatibility across methods

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a pd.DataFrame

model_difference(canonical: ~typing.Any, other: ~typing.Any, on_key: [<class 'str'>, <class 'list'>], drop_zero_sum: bool = None, summary_connector: bool = None, flagged_connector: str = None, detail_connector: str = None, unmatched_connector: str = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

returns the difference between two canonicals, joined on a common and unique key. The on_key parameter can be a direct reference to the canonical column header or to an environment variable. If the environment variable is used on_key should be set to "${<<YOUR_ENVIRON>>}" where <<YOUR_ENVIRON>> is the environment variable name.

If the flagged connector parameter is used, a report flagging mismatched left data with right data is produced for this connector where 1 indicate a difference and 0 they are the same. By default this method returns this report but if this parameter is set the original canonical returned. This allows a canonical pipeline to continue through the component while outputting the difference report.

If the detail connector parameter is used, a detail report of the difference where the left and right values that differ are shown.

If the unmatched connector parameter is used, the on_key’s that don’t match between left and right are reported

Parameters:
  • canonical – a direct or generated pd.DataFrame. see context notes below

  • other – a direct or generated pd.DataFrame. to concatenate

  • on_key – The name of the key that uniquely joins the canonical to others

  • drop_zero_sum – (optional) drops rows and columns which has a total sum of zero differences

  • summary_connector – (optional) a connector name where the summary report is sent

  • flagged_connector – (optional) a connector name where the differences are flagged

  • detail_connector – (optional) a connector name where the differences are shown

  • unmatched_connector – (optional) a connector name where the unmatched keys are shown

  • seed – (optional) this is a placeholder, here for compatibility across methods

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a pd.DataFrame

The other is a pd.DataFrame, a pd.Series or list, a connector contract str reference or a set of parameter instructions on how to generate a pd.Dataframe. the description of each is:

  • pd.Dataframe -> a deep copy of the pd.DataFrame

  • pd.Series or list -> creates a pd.DataFrameof one column with the ‘header’ name or ‘default’ if not given

  • str -> instantiates a connector handler with the connector_name and loads the DataFrame from the connection

  • dict -> use canonical2dict(…) to help construct a dict with a ‘method’ to build a pd.DataFrame
    methods:
    • model_*(…) -> one of the SyntheticBuilder model methods and parameters

    • @empty -> generates an empty pd.DataFrame where size and headers can be passed

      :size sets the index size of the dataframe :headers any initial headers for the dataframe

    • @generate -> generate a synthetic file from a remote Domain Contract

      :task_name the name of the SyntheticBuilder task to run :repo_uri the location of the Domain Product :size (optional) a size to generate :seed (optional) if a seed should be applied :run_book (optional) if specific intent should be run only

model_drop_outliers(canonical: ~typing.Any, header: str, measure: [<class 'int'>, <class 'float'>] = None, method: str = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

Drops rows in the canonical where the values are deemed outliers based on the method and measure. There are three selectable methods of choice, interquartile or empirical, of which interquartile is the default.

The ‘empirical’ rule states that for a normal distribution, nearly all of the data will fall within three standard deviations of the mean. Given mu and sigma, a simple way to identify outliers is to compute a z-score for every value, which is defined as the number of standard deviations away a value is from the mean. therefor measure given should be the z-score or the number of standard deviations away a value is from the mean. The 68–95–99.7 rule, guide the percentage of values that lie within a band around the mean in a normal distribution with a width of two, four and six standard deviations, respectively and thus the choice of z-score

For the ‘interquartile’ range (IQR), also called the midspread, middle 50%, or H‑spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles of a sample set. The IQR can be used to identify outliers by defining limits on the sample values that are a factor k of the IQR below the 25th percentile or above the 75th percentile. The common value for the factor k is 1.5. A factor k of 3 or more can be used to identify values that are extreme outliers.

param canonical:

a pd.DataFrame as the reference dataframe

param header:

the header in the DataFrame to correlate

param method:

(optional) A method to run to identify outliers. interquartile (default) or empirical

param measure:

(optional) A measure against each method, respectively factor k, z-score, quartile (see above)

param seed:

(optional) the random seed

param save_intent:

(optional) if the intent contract should be saved to the property manager

param column_name:

(optional) the column name that groups intent to create a column

param intent_order:

(optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

param replace_intent:

(optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

param remove_duplicates:

(optional) removes any duplicate intent in any level that is identical

return:

an equal length list of correlated values

model_encode_count(canonical: ~typing.Any, headers: [<class 'str'>, <class 'list'>], prefix=None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) DataFrame

encodes categorical data types, In count encoding we replace the categories by the count of the observations that show that category in the dataset. This techniques capture’s the representation of each label in a dataset, but the encoding may not necessarily be predictive of the outcome.

Parameters:
  • canonical – a pd.DataFrame as the reference dataframe

  • headers – the header(s) to apply the encoding

  • prefix – a str to prefix the column

  • seed – seed: (optional) a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a pd.DataFrame

model_encode_integer(canonical: ~typing.Any, headers: [<class 'str'>, <class 'list'>], ranking: list = None, prefix=None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

Integer encoding replaces the categories by digits from 1 to n, where n is the number of distinct categories of the variable. Integer encoding can be either nominal or orinal.

Nominal data is categorical variables without any particular order between categories. This means that the categories cannot be sorted and there is no natural order between them.

Ordinal data represents categories with a natural, ordered relationship between each category. This means that the categories can be sorted in either ascending or descending order. In order to encode integers as ordinal, a ranking must be provided.

If ranking is given, the return will be ordinal values based on the ranking order of the list. If a categorical value is not found in the list it is grouped with other missing values and given the last ranking.

Parameters:
  • canonical – a pd.DataFrame as the reference dataframe

  • headers – the header(s) to apply the encoding

  • ranking – (optional) if used, ranks the categorical values to the list given

  • prefix – a str to prefix the column

  • seed – seed: (optional) a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a pd.DataFrame

model_encode_one_hot(canonical: ~typing.Any, headers: [<class 'str'>, <class 'list'>], prefix=None, dtype: ~typing.Any = None, prefix_sep: str = None, dummy_na: bool = False, drop_first: bool = False, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) DataFrame

encodes categorical data types, One hot encoding, consists in encoding each categorical variable with different boolean variables (also called dummy variables) which take values 0 or 1, indicating if a category is present in an observation.

Parameters:
  • canonical – a pd.DataFrame as the reference dataframe

  • headers – the header(s) to apply multi-hot

  • prefix – str, list of str, or dict of str, String to append DataFrame column names, with equal length.

  • prefix_sep – str separator, default ‘_’

  • dummy_na – Add a column to indicate null values, if False nullss are ignored.

  • drop_first – Whether to get k-1 dummies out of k categorical levels by removing the first level.

  • dtype – Data type for new columns. Only a single dtype is allowed.

  • seed – seed: (optional) a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a pd.DataFrame

model_explode(canonical: ~typing.Any, header: str, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) DataFrame

takes a single column of list values and explodes the DataFrame so row is represented by each elements in the row list

Parameters:
  • canonical – a direct or generated pd.DataFrame. see context notes below

  • header – the header of the column to be exploded

  • seed – (optional) this is a place holder, here for compatibility across methods

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a pd.DataFrame

model_group(canonical: ~typing.Any, group_by: [<class 'str'>, <class 'list'>], headers: [<class 'str'>, <class 'list'>] = None, regex: bool = None, aggregator: str = None, list_choice: int = None, list_max: int = None, drop_group_by: bool = False, seed: int = None, include_weighting: bool = False, freq_precision: int = None, remove_weighting_zeros: bool = False, remove_aggregated: bool = False, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) DataFrame

returns the full column values directly from another connector data source. in addition the the standard groupby aggregators there is also ‘list’ and ‘set’ that returns an aggregated list or set. These can be using in conjunction with ‘list_choice’ and ‘list_size’ allows control of the return values. if list_max is set to 1 then a single value is returned rather than a list of size 1.

Parameters:
  • canonical – a direct or generated pd.DataFrame. see context notes below

  • headers – the column headers to apply the aggregation too

  • group_by – the column headers to group by

  • regex – if the column headers is q regex

  • aggregator – (optional) the aggregator as a function of Pandas DataFrame ‘groupby’ or ‘list’ or ‘set’

  • list_choice – (optional) used in conjunction with list or set aggregator to return a random n choice

  • list_max – (optional) used in conjunction with list or set aggregator restricts the list to a n size

  • drop_group_by – (optional) drops the group by headers

  • include_weighting – (optional) include a percentage weighting column for each

  • freq_precision – (optional) a precision for the relative_freq values

  • remove_aggregated – (optional) if used in conjunction with the weighting then drops the aggrigator column

  • remove_weighting_zeros – (optional) removes zero values

  • seed – (optional) this is a place holder, here for compatibility across methods

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a pd.DataFrame

model_merge(canonical: ~typing.Any, other: ~typing.Any, left_on: str = None, right_on: str = None, on: str = None, how: str = None, headers: list = None, suffixes: tuple = None, indicator: bool = None, validate: str = None, replace_nulls: bool = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) DataFrame

returns the full column values directly from another connector data source.

Parameters:
  • canonical – a direct or generated pd.DataFrame. see context notes below

  • other – a direct or generated pd.DataFrame. see context notes below

  • left_on – the canonical key column(s) to join on

  • right_on – the merging dataset key column(s) to join on

  • on – if th left and right join have the same header name this can replace left_on and right_on

  • how – (optional) One of ‘left’, ‘right’, ‘outer’, ‘inner’. Defaults to inner. See below for more detailed description of each method.

  • headers – (optional) a filter on the headers included from the right side

  • suffixes – (optional) A tuple of string suffixes to apply to overlapping columns. Defaults (‘’, ‘_dup’).

  • indicator – (optional) Add a column to the output DataFrame called _merge with information on the source of each row. _merge is Categorical-type and takes on a value of left_only for observations whose merge key only appears in ‘left’ DataFrame or Series, right_only for observations whose merge key only appears in ‘right’ DataFrame or Series, and both if the observation’s merge key is found in both.

  • validate – (optional) validate : string, default None. If specified, checks if merge is of specified type. “one_to_one” or “1:1”: checks if merge keys are unique in both left and right datasets. “one_to_many” or “1:m”: checks if merge keys are unique in left dataset. “many_to_one” or “m:1”: checks if merge keys are unique in right dataset. “many_to_many” or “m:m”: allowed, but does not result in checks.

  • replace_nulls – (optional) replaces nulls with an appropriate value dependent upon the field type

  • seed – this is a placeholder, here for compatibility across methods

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a pd.DataFrame

The other is a pd.DataFrame, a pd.Series or list, a connector contract str reference or a set of parameter instructions on how to generate a pd.Dataframe. the description of each is:

  • pd.Dataframe -> a deep copy of the pd.DataFrame

  • pd.Series or list -> creates a pd.DataFrame of one column with the ‘header’ name or ‘default’ if not given

  • str -> instantiates a connector handler with the connector_name and loads the DataFrame from the connection

  • dict -> use canonical2dict(…) to help construct a dict with a ‘method’ to build a pd.DataFrame
    methods:
    • model_*(…) -> one of the SyntheticBuilder model methods and parameters

    • @empty -> generates an empty pd.DataFrame where size and headers can be passed

      :size sets the index size of the dataframe :headers any initial headers for the dataframe

    • @generate -> generate a synthetic file from a remote Domain Contract

      :task_name the name of the SyntheticBuilder task to run :repo_uri the location of the Domain Product :size (optional) a size to generate :seed (optional) if a seed should be applied :run_book (optional) if specific intent should be run only

model_missing_cca(canonical: ~typing.Any, threshold: float = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) DataFrame

Applies Complete Case Analysis to the canonical. Complete-case analysis (CCA), also called “list-wise deletion” of cases, consists of discarding observations with any missing values. In other words, we only keep observations with data on all the variables. CCA works well when the data is missing completely at random.

Parameters:
  • canonical – a pd.DataFrame as the reference dataframe

  • threshold – (optional) a null threshold between 0 and 1 where 1 is all nulls. Default to 0.5

  • seed – (optional) a placeholder

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a pd.DataFrame

model_modifier(canonical: ~typing.Any, other: ~typing.Any, targets_header: str = None, values_header: str = None, modifier: str = None, seed: int = None, precision: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) DataFrame

Modifies a given set of target header names, within the canonical with the target value for that name. The aggregator indicates the type of modification to be performed. It is assumed the other DataFrame has the target headers as the first column and the target values as the second column, if this is not the case the targets_header and values_handler parameters can be used to specify the other header names.

Parameters:
  • canonical – a pd.DataFrame as the reference dataframe

  • other – a direct or generated pd.DataFrame. see context notes below

  • targets_header – (optional) the name of the target header where the header names are listed

  • values_header – (optional) The name of the value header where the target values are listed

  • modifier – (optional) how the value is to be modified. Options are ‘add’, ‘sub’, ‘mul’, ‘div’

  • precision – (optional) the value precision of the return values

  • seed – (optional) this is a placeholder, here for compatibility across methods

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a pd.DataFrame

The other is a pd.DataFrame, a pd.Series or list, a connector contract str reference or a set of parameter instructions on how to generate a pd.Dataframe. the description of each is:

  • pd.Dataframe -> a deep copy of the pd.DataFrame

  • pd.Series or list -> creates a pd.DataFrame of one column with the ‘header’ name or ‘default’ if not given

  • str -> instantiates a connector handler with the connector_name and loads the DataFrame from the connection

  • dict -> use canonical2dict(…) to help construct a dict with a ‘method’ to build a pd.DataFrame
    methods:
    • model_*(…) -> one of the SyntheticBuilder model methods and parameters

    • @empty -> generates an empty pd.DataFrame where size and headers can be passed

      :size sets the index size of the dataframe :headers any initial headers for the dataframe

    • @generate -> generate a synthetic file from a remote Domain Contract

      :task_name the name of the SyntheticBuilder task to run :repo_uri the location of the Domain Product :size (optional) a size to generate :seed (optional) if a seed should be applied :run_book (optional) if specific intent should be run only

model_noise(canonical: ~typing.Any, num_columns: int, inc_targets: bool = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) DataFrame

Generates multiple columns of noise in your dataset

Parameters:
  • canonical – a direct or generated pd.DataFrame. see context notes below

  • num_columns – the number of columns of noise

  • inc_targets – (optional) if a predictor target should be included. default is false

  • seed – seed: (optional) a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a DataFrame

The canonical is a pd.DataFrame, a pd.Series or list, a connector contract str reference or a set of parameter instructions on how to generate a pd.Dataframe. the description of each is:

  • pd.Dataframe -> a deep copy of the pd.DataFrame

  • pd.Series or list -> creates a pd.DataFrameof one column with the ‘header’ name or ‘default’ if not given

  • str -> instantiates a connector handler with the connector_name and loads the DataFrame from the connection

  • dict -> use canonical2dict(…) to help construct a dict with a ‘method’ to build a pd.DataFrame
    methods:
    • model_*(…) -> one of the SyntheticBuilder model methods and parameters

    • @empty -> generates an empty pd.DataFrame where size and headers can be passed

      :size sets the index size of the dataframe :headers any initial headers for the dataframe

    • @generate -> generate a synthetic file from a remote Domain Contract

      :task_name the name of the SyntheticBuilder task to run :repo_uri the location of the Domain Product :size (optional) a size to generate :seed (optional) if a seed should be applied :run_book (optional) if specific intent should be run only

model_profiling(canonical: ~typing.Any, profiling: str, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, connector_name: str = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None, **kwargs)

Data profiling provides, analyzing, and creating useful summaries of data. The process yields a high-level overview which aids in the discovery of data quality issues, risks, and overall trends. It can be used to identify any errors, anomalies, or patterns that may exist within the data. There are three types of data profiling available ‘dictionary’, ‘schema’ or ‘quality’

If the connector_name is used, it outputs the results to this connector contract and returns the original canonical. This allows a canonical pipeline to continue through the component while outputting the data profile to an alternative path.

Parameters:
  • canonical – a direct or generated pd.DataFrame. see context notes below

  • profiling – The profiling name. Options are ‘dictionary’, ‘schema’ or ‘quality’

  • headers – (optional) a filter of headers from the ‘other’ dataset

  • drop – (optional) to drop or not drop the headers if specified

  • dtype – (optional) a filter on data type for the ‘other’ dataset. int, float, bool, object

  • exclude – (optional) to exclude or include the data types if specified

  • regex – (optional) a regular expression to search the headers. example ‘^((?!_amt).)*$)’ excludes ‘_amt’

  • re_ignore_case – (optional) true if the regex should ignore case. Default is False

  • connector_name – (optional) a connector name where the outcome is sent

:param seed:(optional) this is a placeholder, here for compatibility across methods :param save_intent: (optional) if the intent contract should be saved to the property manager :param column_name: (optional) the column name that groups intent to create a column :param intent_order: (optional) the order in which each intent should run.

  • If None: default’s to -1

  • if -1: added to a level above any current instance of the intent section, level 0 if not found

  • if int: added to the level specified, overwriting any that already exist

Parameters:
  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

  • kwargs – if using connector_name, any kwargs to pass to the handler

Returns:

a pd.DataFrame

The other is a pd.DataFrame, a pd.Series or list, a connector contract str reference or a set of parameter instructions on how to generate a pd.Dataframe. the description of each is:

  • pd.Dataframe -> a deep copy of the pd.DataFrame

  • pd.Series or list -> creates a pd.DataFrameof one column with the ‘header’ name or ‘default’ if not given

  • str -> instantiates a connector handler with the connector_name and loads the DataFrame from the connection

  • dict -> use canonical2dict(…) to help construct a dict with a ‘method’ to build a pd.DataFrame
    methods:
    • model_*(…) -> one of the SyntheticBuilder model methods and parameters

    • @empty -> generates an empty pd.DataFrame where size and headers can be passed

      :size sets the index size of the dataframe :headers any initial headers for the dataframe

    • @generate -> generate a synthetic file from a remote Domain Contract

      :task_name the name of the SyntheticBuilder task to run :repo_uri the location of the Domain Product :size (optional) a size to generate :seed (optional) if a seed should be applied :run_book (optional) if specific intent should be run only

model_sample(canonical: ~typing.Any, other: ~typing.Any, headers: list, replace: bool = None, relative_freq: list = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) DataFrame

Takes a target dataset and samples from that target to the size of the canonical

Parameters:
  • canonical – a pd.DataFrame as the reference dataframe

  • other – a direct or generated pd.DataFrame. see context notes below

  • headers – the headers to be selected from the other DataFrame

  • replace – assuming other is bigger than canonical, selects without replacement when True

  • relative_freq – (optional) a weighting pattern that does not have to add to 1

  • seed – (optional) a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a DataFrame

model_sample_map(canonical: ~typing.Any, sample_map: str, selection: list = None, headers: [<class 'str'>, <class 'list'>] = None, shuffle: bool = None, rename_columns: dict = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None, **kwargs) DataFrame

builds a model of a Sample Mapped distribution. To see the sample maps available use the MappedSample class __dir__() method:

> from ds_discovery.sample.sample_data import MappedSample > MappedSample().__dir__()

Parameters:
  • canonical – a direct or generated pd.DataFrame. see context notes below

  • sample_map – the sample map name. use MappedSample().__dir__() to get a list of available samples

  • rename_columns – (optional) rename the columns ‘City’, ‘Zipcode’, ‘State’

  • selection – (optional) a list of selections where conditions are filtered on, executed in list order An example of a selection with the minimum requirements is: (see ‘select2dict(…)’) [{‘column’: ‘state’, ‘condition’: “isin([‘NY’, ‘TX’]”}]

  • headers – a header or list of headers to filter on

  • shuffle – (optional) if the selection should be shuffled before selection. Default is true

  • seed – seed: (optional) a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

  • kwargs – any additional parameters to pass to the sample map

Returns:

a DataFrame

The canonical is a pd.DataFrame, a pd.Series or list, a connector contract str reference or a set of parameter instructions on how to generate a pd.Dataframe. the description of each is:

  • pd.Dataframe -> a deep copy of the pd.DataFrame

  • pd.Series or list -> creates a pd.DataFrameof one column with the ‘header’ name or ‘default’ if not given

  • str -> instantiates a connector handler with the connector_name and loads the DataFrame from the connection

  • dict -> use canonical2dict(…) to help construct a dict with a ‘method’ to build a pd.DataFrame
    methods:
    • model_*(…) -> one of the SyntheticBuilder model methods and parameters

    • @empty -> generates an empty pd.DataFrame where size and headers can be passed

      :size sets the index size of the dataframe :headers any initial headers for the dataframe

    • @generate -> generate a synthetic file from a remote Domain Contract

      :task_name the name of the SyntheticBuilder task to run :repo_uri the location of the Domain Product :size (optional) a size to generate :seed (optional) if a seed should be applied :run_book (optional) if specific intent should be run only

model_synthetic_data_types(canonical: int = None, extended: bool = False, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) DataFrame

A dataset with example data types

Parameters:
  • canonical – the canonical size (rows) of the sample dataset

  • extended – if the types should extend beyond the standard 6 types including nulls, predominance, etc.

  • seed – a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

pandas DataSet

model_synthetic_personal_identity(canonical: int = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) DataFrame

A dataset with Personal Identifiable Information

Parameters:
  • canonical – the canonical size (rows) of the sample dataset

  • seed – a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

pandas DataSet

model_to_category(canonical: ~typing.Any, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

converts columns to categories

Parameters:
  • canonical – a pd.DataFrame as the reference dataframe

  • headers – a list of headers to drop or filter on type

  • drop – to drop or not drop the headers

  • dtype – the column types to include or exclude. Default None else int, float, bool, object, ‘number’

  • exclude – to exclude or include the dtypes

  • regex – a regular expression to search the headers

  • re_ignore_case – true if the regex should ignore case. Default is False

  • seed – (optional) a placeholder

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a pd.DataFrame

model_to_numeric(canonical: ~typing.Any, headers: [<class 'str'>, <class 'list'>] = None, drop: bool = None, dtype: [<class 'str'>, <class 'list'>] = None, exclude: bool = None, regex: [<class 'str'>, <class 'list'>] = None, re_ignore_case: bool = None, precision: int = None, seed: int = None, save_intent: bool = None, column_name: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

converts columns to numeric value

Parameters:
  • canonical – a pd.DataFrame as the reference dataframe

  • headers – a list of headers to drop or filter on type

  • drop – to drop or not drop the headers

  • dtype – the column types to include or exclude. Default None else int, float, bool, object, ‘number’

  • exclude – to exclude or include the dtypes

  • regex – a regular expression to search the headers

  • re_ignore_case – true if the regex should ignore case. Default is False

  • precision – (optional) an int value of the precision for the float

  • seed – (optional) a placeholder

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • column_name – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a pd.DataFrame

run_intent_pipeline(canonical: ~typing.Any = None, intent_levels: [<class 'str'>, <class 'int'>, <class 'list'>] = None, run_book: str = None, seed: int = None, simulate: bool = None, **kwargs) DataFrame

Collectively runs all parameterised intent taken from the property manager against the code base as defined by the intent_contract. The whole run can be seeded though any parameterised seeding in the intent contracts will take precedence

Parameters:
  • canonical – a direct or generated pd.DataFrame. see context notes below

  • intent_levels – (optional) a single or list of intent_level to run in order given

  • run_book – (optional) a preset runbook of intent_level to run in order

  • seed – (optional) a seed value that will be applied across the run: default to None

  • simulate – (optional) returns a report of the order of run and return the indexed column order of run

Returns:

a pandas dataframe

property sample_lists: list

A list of sample options

property sample_maps: list

A list of sample options

static select2dict(column: str, condition: str, expect: str | None = None, logic: str | None = None, date_format: str | None = None, offset: int | None = None) dict

a utility method to help build feature conditions by aligning method parameters with dictionary format.

Parameters:
  • column – the column name to apply the condition to

  • condition – the condition string (special conditions are ‘date.now’ for current date)

  • expect – (optional) the data type to expect. If None then the data type is assumed from the dtype

  • logic – (optional) the logic to provide, see below for options

  • date_format – (optional) a format of the date if only a specific part of the date and time is required

  • offset – (optional) a time delta in days (+/-) from the current date and time (minutes not supported)

Returns:

dictionary of the parameters

logic:

AND: the intersect of the current state with the condition result (common to both) NAND: outside the intersect of the current state with the condition result (not common to both) OR: the union of the current state with the condition result (everything in both) NOR: outside the union of the current state with the condition result (everything not in both) NOT: the difference between the current state and the condition result XOR: the difference between the union and the intersect current state with the condition result

extra logic:

ALL: the intersect of the whole index with the condition result irrelevant of level or current state index ANY: the intersect of the level index with the condition result irrelevant of current state index