Reports: Adding Information of Interest

Adding extended information of interest to guide and inform about appropriate attribution and decision making by the creator or editors to promote information share and assist data reuse.

from ds_discovery import Transition, Wrangle

Adding Metadata

During the process of development multiple experts add value to our understanding of the dataset. Project Hadron captures this knowledge as part of its metadata and provides easy access tools to retain this knowledge at real or near real time as well as adding it retrospectively through automated processes.

Knowledge capture is placed under a tree structure of: - catalogue: provides an encompassing group identifier such as attributes or observations. - label: a subset of categories identifying the individual set of text such as attribute name or observation type. - text: a brief or descriptive narrative of the catalogue and label. Text is immutable thus new text with the same catalogue and label will be added to the existing content.

tr = Transition.from_env('demo_metadata', has_contract=False)

Set File Source

Initially we set the file source for our data of interest and run the component.

## Set the file source location
tr.set_source_uri('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv', template_aligned=False)
tr.set_persist()
tr.set_description("Titanic Dataset used by Seaborn")

Adding Attributes

A vital part of understanding one’s dataset is to describe the attributes provided. In this instance we name our catalogue group ‘attributes’. The attributes are labeled with the name of the attribute and given a description.

## Add some attribute descriptions
tr.add_notes(catalog='attributes', label='age', text='The age of the passenger has limited null values')
tr.add_notes(catalog='attributes', label='deck', text='cabin has already been split into deck from the originals')
tr.add_notes(catalog='attributes', label='fare', text='the price of the fair')
tr.add_notes(catalog='attributes', label='pclass', text='The class of the passenger')
tr.add_notes(catalog='attributes', label='sex', text='The gender of the passenger')
tr.add_notes(catalog='attributes', label='survived', text='If the passenger survived or not as the target')
tr.add_notes(catalog='attributes', label='embarked', text='The code for the port the passengered embarked')

Adding Observations

In addition we can capture feedback from an SME or data owner, for example. In this case we capture ‘observations’ as our catalogue and ‘describe’ as our label which we maintain for both descriptions.

One can now use the reporting tool to visually present the knowledge added. It is worth noting that with observations each description has been captured.

tr.add_notes(catalog='observations', label='describe',
             text='The original Titanic dataset has been engineered to fit Seaborn functionality')
tr.add_notes(catalog='observations', label='describe',
             text='The age and deck attributes still maintain their null values')

tr.report_notes(drop_dates=True)

Bulk Notes

In addition to adding individual notes one also has the ability to upload bulk notes from an external data source. In our next example we take an order book and from an already existing description catalogue extract that knowledge and add it to our attributes.

tr = Transition.from_env('cs_orders', has_contract=False)

Set File Source

Initially set the file source for the data of interest and run the component.

tr.set_source_uri(uri='data/CS_ORDERS.txt', sep='\t', error_bad_lines=False, low_memory=True, encoding='Latin1')
tr.set_persist()
tr.set_description("Consumer Notebook Orders for Q4 FY20")

Connect the Bulk Upload

First create a connector to the information source.

tr.add_connector_uri(connector_name='bulk_notes', uri='data/cs_orders_dictionary.csv')

Upload the Descriptions

With our connector in place one can now load that data and specify the columns of interest that provide both the label and the text.

Using our reporting tool one can now observe that attribute descriptions have been uploaded.

notes = tr.load_canonical(connector_name='bulk_notes')
tr.upload_notes(canonical=notes, catalog='attributes', label_key='Attribute', text_key='Description')

tr.report_notes(drop_dates=True)

../../../_images/met_img02.png — not all attributes are displayed

Report Filtering

Sometimes bulk uploads can result in a large amount of added information. Our reporting tool has the ability to filter what we visualize giving us a clean summery of items of interest. In our example we are filtering on ‘label’ across all sections, or catalogues.

tr.report_notes(labels=['ORD_DTS', 'INV_DTS', 'HOLD_DTS'], drop_dates=True)