Wednesday, June 29 • 2:30pm - 3:30pm
Plotting for Marketers - Seeing the Story

Poster #22

The first step, when getting a new data set, is to take a look at the data for completeness, accuracy, and reasonableness. This talk will describe a method based on Jack Olson's Data Quality - The Accuracy Dimension. The input data set can be either a raw text or spreadsheet file or from a source with columnar meta-data like a SQL table or an R data frame. The only setup is to connect to the data source. Using RMarkdown, dplyr, grid, and ggplot2 we produce a report where each column is profiled by data types, summary statistics (if numeric or date), distribution plot, counts, and the head and tail values. This facilitates a quick visual scan of each column for data quality issues. The simple visual format also aids communication with the data provider to dig into quality issues and, hopefully, clean up the data set before wasting time and effort on an analysis flawed by bad data. We provide examples both good and suspect columns.

avatar for Jim  Porzak

Jim Porzak

Principal, DS4CI.org
I am a (semi-)retired data scientist specializing in customer insights. I have been using R since 2002 and have presented at all but two useR! conferences starting with the first Vienna useR! 2004. See my archives, ds4ci.org/archives/ for past presentations including tutorials at useR! Vienna and Dortmund.

Wednesday June 29, 2016 2:30pm - 3:30pm
Sponsor Pavilion 326 Galvez Street Stanford, CA 94305-6105

