Loading…
This event has ended. Visit the official site or create your own event on Sched.
Click here to return to main conference site. For a one page, printable overview of the schedule, see this.
View analytic
Wednesday, June 29 • 2:30pm - 3:30pm
Data Quality Profiling - The First Step with New Data

Log in to save this to your schedule and see who's attending!

Poster #23

The first step, when getting a new data set, is to take a look at the data for completeness, accuracy, and reasonableness. This talk will describe a method based on Jack Olson's Data Quality - The Accuracy Dimension. The input data set can be either a raw text or spreadsheet file or from a source with columnar meta-data like a SQL table or an R data frame. The only setup is to connect to the data source. Using RMarkdown, dplyr, grid, and ggplot2 we produce a report where each column is profiled by data types, summary statistics (if numeric or date), distribution plot, counts, and the head and tail values. This facilitates a quick visual scan of each column for data quality issues. The simple visual format also aids communication with the data provider to dig into quality issues and, hopefully, clean up the data set before wasting time and effort on an analysis flawed by bad data. We provide examples both good and suspect columns.

Speakers
avatar for Jim  Porzak

Jim Porzak

Principal, DS4CI.org
I am a (semi-)retired data scientist specializing in customer insights. I have been using R since 2002 and have presented at all but two useR! conferences starting with the first Vienna useR! 2004. See my archives, ds4ci.org/archives/ for past presentations including tutorials at useR! Vienna and Dortmund.


Wednesday June 29, 2016 2:30pm - 3:30pm
Sponsor Pavilion 326 Galvez Street Stanford, CA 94305-6105

Attendees (61)