Data validation consists of checking whether data agrees with prior knowledge or assumptions about the process that generated the data, including collecting it. Such knowledge can often be expressed as a set of short statements, or rules, which the data must satisfy in order to be acceptable for further analyses.
Such rules may be of technical nature or express domain knowledge. For example, domain knowledge rules include 'Someone who is unemployed can not have an employer (labour force survey)', 'the total profit and cost of an organization must add up to the total revenue (business survey)' and the price of a product in this period must lie within 20% of last year's price (in consumer price index data).
Data validation is an often recurring step in a multi-step data cleaning process where the progress of data quality is monitored throughout. For this reason, the validate package allows one to define data validation rules externally, confront them with data and gather and visualize results.
With the validate package, data validation rules become objects of computation that can be maintained, manipulated and investigated as separate entities. For example, it becomes possible to automatically detect contradictions in certain classes of rule sets. Maintenance is supported by import and export from and to free text or yaml files, allowing rules to be endowed with metadata.