Tutorial¶

This tutorial will show you how to use the Table-Cleaner validation framework.

First, let’s import the necessary modules. My personal style is to abbreviate the scientific python libraries with two letters. This avoids namespace cluttering on the one hand, and is still reasonably short.

import numpy as np
import pandas as pd
from IPython import display
import table_cleaner as tc

IPython.display provides us with the means to display Python objects in a “rich” way, especially useful for tables.

Introduction¶

Validating tabular data, especially from CSV or Excel files is a very common task in data science and even generic programming. Many times this data isn’t “clean” enough for further processing. Writing custom code to transform or clean up this kind of data quickly gets out of hand.

Table-Cleaner is a framework to generalize this cleaning process.

Basic Example¶

First, let’s create a DataFrame with messy data.

initial_df = pd.DataFrame(dict(name=["Alice", "Bob", "Wilhelm Alexander", 1, "Mary", "Andy"],
                            email=["alice@example.com", "bob@example.com", "blub", 4, "mary@example.com",
                            "andy k@example .com"],
                            x=[0,3.2,"5","hello", -3,11,],
                            y=[0.2,3.2,1.3,"hello",-3.0,11.0],
                            active=["Y", None, "T", "false", "no", "T"]
                            ))

display.display(initial_df)

	active	email	name	x	y
0	Y	alice@example.com	Alice	0	0.2
1	None	bob@example.com	Bob	3.2	3.2
2	T	blub	Wilhelm Alexander	5	1.3
3	false	4	1	hello	hello
4	no	mary@example.com	Mary	-3	-3
5	T	andy k@example .com	Andy	11	11

This dataframe contains several columns. Some of the cells don’t look much like the other cells in the same column. For Example we have numbers in the email and name columns and strings in the number columns.

Looking at the dtypes assigned to the dataframe columns reveals a further issue with this mess:

initial_df.dtypes

active    object
email     object
name      object
x         object
y         object
dtype: object

All columns are referred to as “object”, which means they are saved as individual Python objects, rather than strings, integers or floats. This can make further processing inefficient, but also error prone, because different Python objects may not work with certain dataframe functionality.

Let’s define a cleaner:

class MyCleaner(tc.Cleaner):
    name = tc.String(min_length=2, max_length=10)
    email = tc.Email()
    x = tc.Int(min_value=0, max_value=10)
    y = tc.Float64(min_value=0, max_value=10)
    active = tc.Bool()

Cleaner classes contain fields with validators.

The tc.String validator validates every input to a string. Because most Python objects have some way of being represented as a string, this will more often work than not. Pretty much the only failure reason is if the string is encoded wrongly. Additionally, it can impose restrictions on minimum and maximum string length.

The tc.Int instance tries to turn the input into integer objects. This usually only works with numbers, or strings which look like integers. Here, also, minimum and maximum values can be optionally specified.

The cleaner object can now validate the input dataframe like this:

cleaner = MyCleaner(initial_df)

Instantiating the cleaner class with an input dataframe creates a cleaner instance with data and verdicts. The validation happens inside the constructor.

cleaner.cleaned

	active	email	name	x	y
0	True	alice@example.com	Alice	0	0.2
1	NaN	bob@example.com	Bob	3	3.2

cleaner.cleaned.dtypes

active     object
email      object
name       object
x           int64
y         float64
dtype: object

The DataFrame only contains completely valid rows, because the default behavior is to delete any rows containing an error. See below on how to use missing values instead.

The datatypes for the “x” column is now int64 instead of object. “y” is now float64. Pandas uses the dtype system specified in numpy, and numpy references strings as “object”. The main reason for this is that numeric data is usually stored in a contiguous way, meaning every value has the same “width” of bytes in memory. Strings, not so much. Their size varies. So arrays containing strings have to reference a string object with a pointer. Then the array of pointers is contiguous with a fixed number of bytes per pointer.

The “active” column is validated as a boolean field. There is a dtype called bool, but it only allows True and False. If there are missing values, the column reverts to “object”. To force the bool dtype, read the section about booleans below.

So far, we have ensured only valid data is in the output table. But Table Cleaner can do more: The errors themselves can be treated as data:

cleaner.verdicts

	column	counter	description	reason	valid
0	active	0	undefined verdict	undefined	True
0	name	1	undefined verdict	undefined	True
0	email	2	undefined verdict	undefined	True
0	x	3	undefined verdict	undefined	True
0	y	4	undefined verdict	undefined	True
1	active	5	undefined verdict	undefined	True
1	name	6	undefined verdict	undefined	True
1	email	7	undefined verdict	undefined	True
1	x	8	undefined verdict	undefined	True
1	y	9	undefined verdict	undefined	True
2	active	10	undefined verdict	undefined	True
2	name	11	'Wilhelm Alexander' has more than 10 characters	too long	False
2	email	12	E-Mail addresses must contain one @ character.	email_without_at	False
2	x	13	undefined verdict	undefined	True
2	y	14	undefined verdict	undefined	True
3	active	15	undefined verdict	undefined	True
3	name	16	'1' has fewer than 2 characters	too short	False
3	email	17	E-Mail addresses must contain one @ character.	email_without_at	False
3	x	18	'hello' cannot be converted to int32	invalid int32	False
3	y	19	'hello' cannot be converted to float64	invalid float64	False
4	active	20	undefined verdict	undefined	True
4	name	21	undefined verdict	undefined	True
4	email	22	undefined verdict	undefined	True
4	x	23	-3 is lower than 0	value too low	False
4	y	24	-3 is lower than 0	value too low	False
5	active	25	undefined verdict	undefined	True
5	name	26	undefined verdict	undefined	True
5	email	27	'example .com' is not a valid email domain name	email_domain_name_invalid	False
5	email	28	'andy k' is not a valid email user name	email_user_name_invalid	False
5	x	29	11 is higher than 10	value too high	False
5	y	30	11 is higher than 10	value too high	False

In this case there is only one row per cell, or one per row and column. Except for the last row, where there are two warnings/errors for the Email column. In the current set of built-in validators this arises very rarely. Just keep in mind not to sum the errors up naively and call it the “number of invalid data points”.

Let’s filter the verdicts by validity:

errors = cleaner.verdicts[~cleaner.verdicts.valid]
display.display(errors)

	column	counter	description	reason	valid
2	name	11	'Wilhelm Alexander' has more than 10 characters	too long	False
2	email	12	E-Mail addresses must contain one @ character.	email_without_at	False
3	name	16	'1' has fewer than 2 characters	too short	False
3	email	17	E-Mail addresses must contain one @ character.	email_without_at	False
3	x	18	'hello' cannot be converted to int32	invalid int32	False
3	y	19	'hello' cannot be converted to float64	invalid float64	False
4	x	23	-3 is lower than 0	value too low	False
4	y	24	-3 is lower than 0	value too low	False
5	email	27	'example .com' is not a valid email domain name	email_domain_name_invalid	False
5	email	28	'andy k' is not a valid email user name	email_user_name_invalid	False
5	x	29	11 is higher than 10	value too high	False
5	y	30	11 is higher than 10	value too high	False

As this is an ordinary DataFrame, we can do all the known shenanigans to it, for example:

errors.groupby(["column", "reason"])["counter",].count()

		counter
column	reason
email	email_domain_name_invalid	1
	email_user_name_invalid	1
	email_without_at	2
name	too long	1
name	too short	1
x	invalid int32	1
	value too high	1
	value too low	1
y	invalid float64	1
	value too high	1
	value too low	1

This functionality is the main reason why Table Cleaner was initially written. In reproducible data science, it is important not only to validate input data, but also be aware of, analyze and present the errors present in the data.

Markup Frames¶

Let’s bring some color into our tables. First, define some CSS styles for the notebook, like so:

%%html
<style>
.tc-cell-invalid {
    background-color: #ff8080
}
.tc-highlight {
    color: red;
    font-weight: bold;
    margin: 3px solid black;
    background-color: #b0b0b0;
}

.tc-green {
    background-color: #80ff80
}
.tc-blue {
    background-color: #8080ff;
}
</style>

The MarkupFrame class is subclassed from Pandas’ DataFrame class and is used to manipulate and render cell-specific markup. It behaves almost exactly the same as a DataFrame.

Caution: This functionality will soon be completely rewritten to have a simpler and cleaner API.

It can be created from a validation like this:

mdf = tc.MarkupFrame.from_validation(initial_df, cleaner.verdicts)
mdf

	active	email	name	x	y
0	Y	alice@example.com	Alice	0	0.2
1	None	bob@example.com	Bob	3.2	3.2
2	T	blub	Wilhelm Alexander	5	1.3
3	false	4	1	hello	hello
4	no	mary@example.com	Mary	-3	-3.0
5	T	andy k@example .com	Andy	11	11.0

Note that we put in the initial_df table, because the verdicts always relate to the original dataframe, not the output, which has possibly been altered and shortened during the validation process.

Now watch this:

mdf.x[1] += "tc-highlight"
mdf.y += "tc-green"
mdf.ix[0, :] += "tc-blue"
mdf

	active	email	name	x	y
0	Y	alice@example.com	Alice	0	0.2
1	None	bob@example.com	Bob	3.2	3.2
2	T	blub	Wilhelm Alexander	5	1.3
3	false	4	1	hello	hello
4	no	mary@example.com	Mary	-3	-3.0
5	T	andy k@example .com	Andy	11	11.0

Booleans¶

The trouble with Booleans

Boolean values are either True or False. In Pandas, and data science in general, things are a bit more tricky. There is a third state, which Pandas would refer to as a missing value. Numpy’s Bool dtype does not support missing values though.

np.bool(None)

False

What’s happening there is that many Python objects have a way of being interpreted as either True or False. An empty list, empty strings, and None, are all considered false, for example.

Now, let’s try that in Pandas:

bools = pd.Series([True, False, None, np.NaN])
bools

   True
  False
   None
    NaN
dtype: object

The dtype is not “bool”. Instead Pandas refers to the individual Python object, and thus dtype must be “object”. We can make it bool, though:

bools.astype(bool)

   True
  False
  False
   True
dtype: bool

Notice how np.NaN, which is normally interpreted as a missing value, has been converted to True?

If you try to index something with this sequence, this is what happens:

original = pd.Series(range(3))
original[bools]

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-17-3d0782e602c7> in <module>()
      1 original = pd.Series(range(3))
----> 2 original[bools]


/usr/local/lib/python3.4/site-packages/pandas/core/series.py in __getitem__(self, key)
    544             key = list(key)
    545
--> 546         if _is_bool_indexer(key):
    547             key = _check_bool_indexer(self.index, key)
    548


/usr/local/lib/python3.4/site-packages/pandas/core/common.py in _is_bool_indexer(key)
   2058             if not lib.is_bool_array(key):
   2059                 if isnull(key).any():
-> 2060                     raise ValueError('cannot index with vector containing '
   2061                                      'NA / NaN values')
   2062                 return False


ValueError: cannot index with vector containing NA / NaN values

Let’s take a look at how to bring some sanity into this issue with Table Cleaner. First, define a messy DataFrame, with columns that are identical:

bools = [True, False, None, np.NaN]

bool_df = pd.DataFrame(dict(a=bools, b=bools, c=bools, d=bools))
bool_df

	a	b	c	d
0	True	True	True	True
1	False	False	False	False
2	None	None	None	None
3	NaN	NaN	NaN	NaN

Now create a cleaner which validates each column differently:

class BoolCleaner(tc.Cleaner):
    a = tc.Bool()
    b = tc.Bool(true_values=[True], false_values=[False], allow_nan=False)
    c = tc.Bool(true_values=[True], false_values=[False, None], allow_nan=False)
    d = tc.Bool(true_values=[True], false_values=[False, np.nan], nan_values=[None], allow_nan=False)

bool_cleaner = BoolCleaner(bool_df)
tc.MarkupFrame.from_validation(bool_df, bool_cleaner.verdicts)

	a	b	c	d
0	True	True	True	True
1	False	False	False	False
2	None	None	None	None
3	nan	nan	nan	nan

Note that I used “delete=False” to keep rows with invalid data, while still converting available values. Then this dataframe has the same shape as MarkupFrame.from_validation expects. “allow_nan” defaults to True and controls whether or not missing values are considered an error.

bool_cleaner.verdicts[~bool_cleaner.verdicts.valid]

	column	counter	description	reason	valid
2	d	9	None cannot be converted to True or False.	bool_nan_not_allowed	False
2	b	10	None cannot be converted to True or False.	bool_nan_not_allowed	False
3	b	14	nan cannot be converted to True or False.	bool_nan_not_allowed	False
3	c	15	nan cannot be converted to True or False.	bool_nan_not_allowed	False

Tables coming from external sources, especially spreadsheet data is notorious for having all sorts of ways to indicate booleans or missing values. The Bool validator takes three arguments to handle these cases: true_values, false_values and nan_values.

messy_bools_column =["T","t","on","yes", "No", "F"]
messy_bools = pd.DataFrame(dict(a=messy_bools_column, b=messy_bools_column))

class BoolCleaner2(tc.Cleaner):
    a=tc.Bool()
    b=tc.Bool(true_values=["T"], false_values=["F"], allow_nan=False)

bool_cleaner2 = BoolCleaner2(messy_bools)
tc.MarkupFrame.from_validation(messy_bools, bool_cleaner2.verdicts)

	a	b
0	T	T
1	t	t
2	on	on
3	yes	yes
4	No	No
5	F	F

Email validation¶

Email validation is a subject onto its own. Some frameworks offer validation by simple regular expressions, which sometimes isn’t enough. Other libraries or programs go so far as to ask the corresponding mail server if it knows a particular address.

In almost all generic usecases, you expect email names to adhere to a very specific form, meaning a username “at” a particular globally identifiable domain name. It is assumed that every computer in the world can resolve this domain name to the same physical server. Email standards and most eail servers however, don’t require “fully qualified domain names” or even globally resolvable domains. “root@localhost” is a perfectly valid email address, but completely useless in most circumstances where you want to collect or use email addresses.

TableCleaner’s Email validator class is based on Django’s validation method.

messy_emails =["alice@example.com", "bob@bob.com", "chris", "delta@localhost", "ernest@hemmingway@ernest.org", "fridolin@dev_server"]
email_df = pd.DataFrame(dict(email=messy_emails))
class EmailCleaner(tc.Cleaner):
    email = tc.Email()

email_cleaner = EmailCleaner(email_df)
tc.MarkupFrame.from_validation(email_df, email_cleaner.verdicts)

	email
0	alice@example.com
1	bob@bob.com
2	chris
3	delta@localhost
4	ernest@hemmingway@ernest.org
5	fridolin@dev_server