1 00:00:00,360 --> 00:00:04,740 hello my name is Devin Lange I'm a PhD  student at the University of Utah and 2 00:00:04,740 --> 00:00:10,080 the current member of the visualization  Design Lab in this video I'm going to be 3 00:00:10,080 --> 00:00:17,940 talking about Ferret our recent work to help  review tabular data sets for manipulation and 4 00:00:17,940 --> 00:00:24,900 in particular this video is going to focus  on a specific case study of the the tool 5 00:00:27,180 --> 00:00:32,040 so to begin let me give a quick overview of  the interface on the left hand side we have 6 00:00:32,040 --> 00:00:40,800 a number of analyzes each one of these analyzes  is designed to help highlight patterns that that 7 00:00:40,800 --> 00:00:48,240 indicate that data may have been manipulated  in the top middle portion of the screen we 8 00:00:48,240 --> 00:00:54,600 include a description that describes the the  visual the visualizations as well as what to 9 00:00:54,600 --> 00:01:02,460 look for and caveats to keep in mind since  I'm going through the demo I'm going to hide 10 00:01:02,460 --> 00:01:07,800 these descriptions so we have more space for our  main view which is the the tabular visualization 11 00:01:09,480 --> 00:01:15,120 a quick overview for the data that we're looking  at in this case study this is associated with a 12 00:01:15,120 --> 00:01:20,220 paper that has been retracted in particular  experiment number three and the way this 13 00:01:20,220 --> 00:01:25,680 experiment is described in the paper is that  the researchers worked with a insurance company 14 00:01:27,180 --> 00:01:33,300 and uh recorded asked people to record  their odometer reading of their car how 15 00:01:33,300 --> 00:01:39,180 many miles that car had driven then after  some period had elapsed paper didn't go 16 00:01:39,180 --> 00:01:43,860 into details for exactly what what period  of time after this period of time elapsed 17 00:01:44,700 --> 00:01:52,200 the the uh owners of the vehicles were asked  to report their odometer reading again and 18 00:01:52,200 --> 00:01:59,940 since each row here is a actual insurance policy  there can be up to four cars on it but most of the 19 00:02:00,540 --> 00:02:09,420 policies only have a single car listed for each  car there is a previous column the odometer at the 20 00:02:09,420 --> 00:02:14,820 beginning of this period and an update column the  the odometer reading after this period has elapsed 21 00:02:16,080 --> 00:02:20,580 so with that let's talk about why some of these  cells are highlighted in blue and some are white 22 00:02:21,420 --> 00:02:31,140 if you hover over a cell it explains all of the  formatting things like the font and font size 23 00:02:32,760 --> 00:02:40,620 and these things are what Drive the um the  the the formatting that you see on the screen 24 00:02:41,580 --> 00:02:47,640 so we see that there are 20 000 cells that have  a font size of 1200 and have a font of Cambria 25 00:02:47,640 --> 00:02:55,200 however most of the cells are actually in calibri  font so already we can see in this first column 26 00:02:56,160 --> 00:03:00,900 uh there's this sort of strange mix of these  two different fonts throughout the data 27 00:03:01,800 --> 00:03:06,600 and we can switch to an overview mode that  lets us view many many columns at once 28 00:03:08,280 --> 00:03:16,740 and if we scroll through our data set we can see  that that this pattern continues throughout the 29 00:03:16,740 --> 00:03:22,620 the 13 000 rows or so where there's this mix  of these two fonts in this column this column 30 00:03:23,520 --> 00:03:29,880 the odometer reading One update is all  Cambria and then the rest appear to be 31 00:03:30,420 --> 00:03:37,860 calibri so let me quickly go back to  the top of this list and I'm going 32 00:03:37,860 --> 00:03:45,480 to now sort by the values within this column  and here once we do this new patterns emerge 33 00:03:45,480 --> 00:03:50,700 so instead of this this random mixing of these  two fonts the first thing we can notice is that 34 00:03:51,300 --> 00:04:03,960 uh every single value of zero within this region  is going to be in this calibri font right there 35 00:04:03,960 --> 00:04:09,540 are there so there's uh our first difference  between the values in calibri and Cambria 36 00:04:12,240 --> 00:04:20,220 similarly if we look at the next range  of numbers close to a thousand right this 37 00:04:21,660 --> 00:04:32,040 period between uh 600 and 500 and a thousand  um is not entirely in this blue font but it is 38 00:04:33,360 --> 00:04:45,600 mostly this this blue font now if we continue down  we have more of a a mixing of fonts so to speak 39 00:04:46,560 --> 00:04:51,600 but you may notice if you look  carefully that there are some regions 40 00:04:52,500 --> 00:05:01,440 that have larger consecutive chunks of white  fonts so for instance here and here are fairly 41 00:05:01,440 --> 00:05:08,340 large chunks of this white font but we don't see  the same thing with the blue fonts so if we look 42 00:05:08,340 --> 00:05:15,600 at the actual values in those regions what we see  is that they are these rounded numbers so 37 000 43 00:05:17,400 --> 00:05:27,960 38 000 . um and forty thousand and interestingly  the the rounder the number the larger list chunk 44 00:05:27,960 --> 00:05:33,600 is so between these these numbers 40 000 is  the the biggest so if you're looking at data 45 00:05:33,600 --> 00:05:38,700 that has been self-reported like odometer readings  reported by owners of the vehicle It's Not Unusual 46 00:05:38,700 --> 00:05:44,100 for there to be this sort of rounding effect  instead of looking at the actual reading they 47 00:05:44,100 --> 00:05:50,160 just estimate and go with the the closest number  that's reasonable so that by itself is Not Unusual 48 00:05:51,240 --> 00:05:57,300 um but if we look at the data we do not see  that same rounding effect in the blue font 49 00:05:57,300 --> 00:06:04,320 so again here there's a a difference in the  data within a single column between these two 50 00:06:07,140 --> 00:06:08,700 okay moving on 51 00:06:11,100 --> 00:06:14,220 um to The Other Extreme of the data 52 00:06:15,180 --> 00:06:20,460 if we move to the largest values  within this column we see a new pattern 53 00:06:22,560 --> 00:06:31,500 and I can click and drag to highlight many rows at  once and now we can see this interesting pattern 54 00:06:31,500 --> 00:06:37,140 where values alternate between this blue  font and this white font calibrian Cambria 55 00:06:39,600 --> 00:06:54,360 and furthermore we see pairings of numbers so the  largest two numbers in the data set 982 and 983 56 00:06:54,360 --> 00:07:00,300 000 are within one thousand miles of each  other and then the next two closest also 57 00:07:00,300 --> 00:07:06,960 share this pattern and if you go one by one  you see this for every single pattern here 58 00:07:06,960 --> 00:07:12,480 where every single excuse me pair here where the  two values are within 1000 miles of each other 59 00:07:13,200 --> 00:07:21,240 and in fact if you look at the the second third  fourth cars on the policy for insurance policies 60 00:07:21,240 --> 00:07:27,000 that have more than one car you see the same  pattern right these two odometers are within 61 00:07:27,000 --> 00:07:33,480 a thousand miles these two are within a  thousand miles um and so on so there's 62 00:07:34,680 --> 00:07:43,440 a bit of a strange uh uh pattern of of matching  values at this tail end of the distribution 63 00:07:47,520 --> 00:07:54,180 okay so I'm going to move on to our  next analysis here so I will skip the 64 00:07:54,180 --> 00:07:57,660 value distribution and go to the duplicate numbers 65 00:08:01,020 --> 00:08:08,220 this uh and I'm going to just make this a  little bit larger here to make it easier to see 66 00:08:09,960 --> 00:08:16,140 this analysis is showing values that  are duplicated within a single column 67 00:08:17,940 --> 00:08:23,760 so for odometer reading one  previous we can see that 68 00:08:24,840 --> 00:08:31,260 the most duplicated numbers are again these  very round numbers right sixty thousand 69 00:08:32,220 --> 00:08:38,520 was duplicated 10 times in other words there  were excuse me 30 times there were 30 cars 70 00:08:38,520 --> 00:08:44,340 within this data set that reported  an odometer of sixty thousand miles 71 00:08:48,780 --> 00:08:52,800 uh on the other hand the second  column the odometer reading 72 00:08:52,800 --> 00:09:00,960 after some time has passed the update column  you do not see these round numbers and the the 73 00:09:01,800 --> 00:09:06,720 amount of duplicates are much higher  lower three instead of 30 for instance 74 00:09:08,340 --> 00:09:17,100 for the second car the values of zero actually  in this case correspond to not having a second 75 00:09:17,100 --> 00:09:25,440 car but we can ignore those values actually in  any column and with those ignored we see the 76 00:09:25,440 --> 00:09:31,920 same pattern for the second car the previous  has round numbers many duplicates the updated 77 00:09:31,920 --> 00:09:37,260 does not have round numbers no duplicates  so again there seems to be some difference 78 00:09:38,700 --> 00:09:46,500 um between these columns for the previous  column and the update column now we see the 79 00:09:46,500 --> 00:09:52,560 same type of rounding effect in the replicate  analysis and the duplicate digit analysis 80 00:09:53,340 --> 00:10:00,540 but I'm going to skip to the the trailing  digits which shows this perhaps the most clearly 81 00:10:01,980 --> 00:10:09,480 right this uh chart is looking at the the relative  frequency of the last digits of numbers so 82 00:10:09,480 --> 00:10:16,860 again for the First Column you can see that 20  percent roughly of the numbers end with a zero 83 00:10:17,460 --> 00:10:23,520 and then the rest are uniform compared to the  update where all of the values are uniform and 84 00:10:23,520 --> 00:10:32,040 again we see the same pattern in the second  third fourth cars so the last thing I'm going 85 00:10:32,040 --> 00:10:38,820 to switch to is this last analysis listed  here for checking your domain expectations 86 00:10:40,560 --> 00:10:48,240 um and this provides a collection of different  visualizations to help you analyze your data 87 00:10:49,080 --> 00:10:55,320 so the first thing I'm going to do is I'm going to  pull in our odometer reading one for the previous 88 00:10:55,320 --> 00:11:01,380 and update this is what we've been looking  at and this is going to plot a scatter plot 89 00:11:02,820 --> 00:11:11,220 so if I reduce the opacity here  and let me reset this actually 90 00:11:12,900 --> 00:11:18,780 great so now that we have this I want to  just point out a couple quick things the 91 00:11:18,780 --> 00:11:24,480 bottom axis is the previous and the  top X the y-axis is the the update 92 00:11:25,140 --> 00:11:31,800 so to read this if you look at you know a point  right here that means the car started with 200 93 00:11:31,800 --> 00:11:38,460 000 miles and then didn't drive it all at the end  of this time period it was still at 200 000 miles 94 00:11:38,460 --> 00:11:46,620 so in other words everything on this sort of uh  line at the bottom of this boundary are all cars 95 00:11:46,620 --> 00:11:51,840 that drove zero miles or very close to zero miles  so it's good we don't see anything below this 96 00:11:53,100 --> 00:12:01,560 um because those would all indicate negative  milestone on the other hand if we look at 97 00:12:01,560 --> 00:12:05,700 this Top Line right we can look at for  instance this point this is a car that 98 00:12:06,780 --> 00:12:15,180 started with odometer reading of zero drove 50  000 miles and then at the endless idea drove 50. 99 00:12:15,720 --> 00:12:24,060 yeah it had 50 000 miles greater on its odometer  so everything on this upper line are cars that 100 00:12:24,060 --> 00:12:32,220 drove 50 000 miles so given the the number  of cars roughly fifteen thousand total um and 101 00:12:32,220 --> 00:12:38,940 the the density there it is a bit strange to see  this very abrupt cutoff with so many cars driving 102 00:12:39,840 --> 00:12:48,360 close to 50 000 miles but no cars driving you  know 51 000 miles or or any any any value greater 103 00:12:51,360 --> 00:12:58,920 okay so that concludes the the  abbreviated version of this case study 104 00:12:59,820 --> 00:13:08,820 if you're interested in learning more about the  tool check out the paper thank you very much yeah