This event has ended. Visit the official site or create your own event on Sched.
Click here to return to main conference site. For a one page, printable overview of the schedule, see this.
View analytic
Tuesday, June 28 • 5:21pm - 5:39pm
Data validation infrastructure: the validate package

Log in to save this to your schedule and see who's attending!

Data validation consists of checking whether data agrees with prior knowledge or assumptions about the process that generated the data, including collecting it. Such knowledge can often be expressed as a set of short statements, or rules, which the data must satisfy in order to be acceptable for further analyses.

Such rules may be of technical nature or express domain knowledge. For example, domain knowledge rules include 'Someone who is unemployed can not have an employer (labour force survey)', 'the total profit and cost of an organization must add up to the total revenue (business survey)' and the price of a product in this period must lie within 20% of last year's price (in consumer price index data).

Data validation is an often recurring step in a multi-step data cleaning process where the progress of data quality is monitored throughout. For this reason, the validate package allows one to define data validation rules externally, confront them with data and gather and visualize results.

With the validate package, data validation rules become objects of computation that can be maintained, manipulated and investigated as separate entities. For example, it becomes possible to automatically detect contradictions in certain classes of rule sets. Maintenance is supported by import and export from and to free text or yaml files, allowing rules to be endowed with metadata.

avatar for Gabriela de Queiroz

Gabriela de Queiroz

Data Scientist, Sharethrough


Mark van der Loo

Stats consultant, researcher, Statistics Netherlands