1
00:00:00,360 --> 00:00:04,740
hello my name is Devin Lange I'm a PhD 
student at the University of Utah and

2
00:00:04,740 --> 00:00:10,080
the current member of the visualization 
Design Lab in this video I'm going to be

3
00:00:10,080 --> 00:00:17,940
talking about Ferret our recent work to help 
review tabular data sets for manipulation and

4
00:00:17,940 --> 00:00:24,900
in particular this video is going to focus 
on a specific case study of the the tool

5
00:00:27,180 --> 00:00:32,040
so to begin let me give a quick overview of 
the interface on the left hand side we have

6
00:00:32,040 --> 00:00:40,800
a number of analyzes each one of these analyzes 
is designed to help highlight patterns that that

7
00:00:40,800 --> 00:00:48,240
indicate that data may have been manipulated 
in the top middle portion of the screen we

8
00:00:48,240 --> 00:00:54,600
include a description that describes the the 
visual the visualizations as well as what to

9
00:00:54,600 --> 00:01:02,460
look for and caveats to keep in mind since 
I'm going through the demo I'm going to hide

10
00:01:02,460 --> 00:01:07,800
these descriptions so we have more space for our 
main view which is the the tabular visualization

11
00:01:09,480 --> 00:01:15,120
a quick overview for the data that we're looking 
at in this case study this is associated with a

12
00:01:15,120 --> 00:01:20,220
paper that has been retracted in particular 
experiment number three and the way this

13
00:01:20,220 --> 00:01:25,680
experiment is described in the paper is that 
the researchers worked with a insurance company

14
00:01:27,180 --> 00:01:33,300
and uh recorded asked people to record 
their odometer reading of their car how

15
00:01:33,300 --> 00:01:39,180
many miles that car had driven then after 
some period had elapsed paper didn't go

16
00:01:39,180 --> 00:01:43,860
into details for exactly what what period 
of time after this period of time elapsed

17
00:01:44,700 --> 00:01:52,200
the the uh owners of the vehicles were asked 
to report their odometer reading again and

18
00:01:52,200 --> 00:01:59,940
since each row here is a actual insurance policy 
there can be up to four cars on it but most of the

19
00:02:00,540 --> 00:02:09,420
policies only have a single car listed for each 
car there is a previous column the odometer at the

20
00:02:09,420 --> 00:02:14,820
beginning of this period and an update column the 
the odometer reading after this period has elapsed

21
00:02:16,080 --> 00:02:20,580
so with that let's talk about why some of these 
cells are highlighted in blue and some are white

22
00:02:21,420 --> 00:02:31,140
if you hover over a cell it explains all of the 
formatting things like the font and font size

23
00:02:32,760 --> 00:02:40,620
and these things are what Drive the um the 
the the formatting that you see on the screen

24
00:02:41,580 --> 00:02:47,640
so we see that there are 20 000 cells that have 
a font size of 1200 and have a font of Cambria

25
00:02:47,640 --> 00:02:55,200
however most of the cells are actually in calibri 
font so already we can see in this first column

26
00:02:56,160 --> 00:03:00,900
uh there's this sort of strange mix of these 
two different fonts throughout the data

27
00:03:01,800 --> 00:03:06,600
and we can switch to an overview mode that 
lets us view many many columns at once

28
00:03:08,280 --> 00:03:16,740
and if we scroll through our data set we can see 
that that this pattern continues throughout the

29
00:03:16,740 --> 00:03:22,620
the 13 000 rows or so where there's this mix 
of these two fonts in this column this column

30
00:03:23,520 --> 00:03:29,880
the odometer reading One update is all 
Cambria and then the rest appear to be

31
00:03:30,420 --> 00:03:37,860
calibri so let me quickly go back to 
the top of this list and I'm going

32
00:03:37,860 --> 00:03:45,480
to now sort by the values within this column 
and here once we do this new patterns emerge

33
00:03:45,480 --> 00:03:50,700
so instead of this this random mixing of these 
two fonts the first thing we can notice is that

34
00:03:51,300 --> 00:04:03,960
uh every single value of zero within this region 
is going to be in this calibri font right there

35
00:04:03,960 --> 00:04:09,540
are there so there's uh our first difference 
between the values in calibri and Cambria

36
00:04:12,240 --> 00:04:20,220
similarly if we look at the next range 
of numbers close to a thousand right this

37
00:04:21,660 --> 00:04:32,040
period between uh 600 and 500 and a thousand 
um is not entirely in this blue font but it is

38
00:04:33,360 --> 00:04:45,600
mostly this this blue font now if we continue down 
we have more of a a mixing of fonts so to speak

39
00:04:46,560 --> 00:04:51,600
but you may notice if you look 
carefully that there are some regions

40
00:04:52,500 --> 00:05:01,440
that have larger consecutive chunks of white 
fonts so for instance here and here are fairly

41
00:05:01,440 --> 00:05:08,340
large chunks of this white font but we don't see 
the same thing with the blue fonts so if we look

42
00:05:08,340 --> 00:05:15,600
at the actual values in those regions what we see 
is that they are these rounded numbers so 37 000

43
00:05:17,400 --> 00:05:27,960
38 000 . um and forty thousand and interestingly 
the the rounder the number the larger list chunk

44
00:05:27,960 --> 00:05:33,600
is so between these these numbers 40 000 is 
the the biggest so if you're looking at data

45
00:05:33,600 --> 00:05:38,700
that has been self-reported like odometer readings 
reported by owners of the vehicle It's Not Unusual

46
00:05:38,700 --> 00:05:44,100
for there to be this sort of rounding effect 
instead of looking at the actual reading they

47
00:05:44,100 --> 00:05:50,160
just estimate and go with the the closest number 
that's reasonable so that by itself is Not Unusual

48
00:05:51,240 --> 00:05:57,300
um but if we look at the data we do not see 
that same rounding effect in the blue font

49
00:05:57,300 --> 00:06:04,320
so again here there's a a difference in the 
data within a single column between these two

50
00:06:07,140 --> 00:06:08,700
okay moving on

51
00:06:11,100 --> 00:06:14,220
um to The Other Extreme of the data

52
00:06:15,180 --> 00:06:20,460
if we move to the largest values 
within this column we see a new pattern

53
00:06:22,560 --> 00:06:31,500
and I can click and drag to highlight many rows at 
once and now we can see this interesting pattern

54
00:06:31,500 --> 00:06:37,140
where values alternate between this blue 
font and this white font calibrian Cambria

55
00:06:39,600 --> 00:06:54,360
and furthermore we see pairings of numbers so the 
largest two numbers in the data set 982 and 983

56
00:06:54,360 --> 00:07:00,300
000 are within one thousand miles of each 
other and then the next two closest also

57
00:07:00,300 --> 00:07:06,960
share this pattern and if you go one by one 
you see this for every single pattern here

58
00:07:06,960 --> 00:07:12,480
where every single excuse me pair here where the 
two values are within 1000 miles of each other

59
00:07:13,200 --> 00:07:21,240
and in fact if you look at the the second third 
fourth cars on the policy for insurance policies

60
00:07:21,240 --> 00:07:27,000
that have more than one car you see the same 
pattern right these two odometers are within

61
00:07:27,000 --> 00:07:33,480
a thousand miles these two are within a 
thousand miles um and so on so there's

62
00:07:34,680 --> 00:07:43,440
a bit of a strange uh uh pattern of of matching 
values at this tail end of the distribution

63
00:07:47,520 --> 00:07:54,180
okay so I'm going to move on to our 
next analysis here so I will skip the

64
00:07:54,180 --> 00:07:57,660
value distribution and go to the duplicate numbers

65
00:08:01,020 --> 00:08:08,220
this uh and I'm going to just make this a 
little bit larger here to make it easier to see

66
00:08:09,960 --> 00:08:16,140
this analysis is showing values that 
are duplicated within a single column

67
00:08:17,940 --> 00:08:23,760
so for odometer reading one 
previous we can see that

68
00:08:24,840 --> 00:08:31,260
the most duplicated numbers are again these 
very round numbers right sixty thousand

69
00:08:32,220 --> 00:08:38,520
was duplicated 10 times in other words there 
were excuse me 30 times there were 30 cars

70
00:08:38,520 --> 00:08:44,340
within this data set that reported 
an odometer of sixty thousand miles

71
00:08:48,780 --> 00:08:52,800
uh on the other hand the second 
column the odometer reading

72
00:08:52,800 --> 00:09:00,960
after some time has passed the update column 
you do not see these round numbers and the the

73
00:09:01,800 --> 00:09:06,720
amount of duplicates are much higher 
lower three instead of 30 for instance

74
00:09:08,340 --> 00:09:17,100
for the second car the values of zero actually 
in this case correspond to not having a second

75
00:09:17,100 --> 00:09:25,440
car but we can ignore those values actually in 
any column and with those ignored we see the

76
00:09:25,440 --> 00:09:31,920
same pattern for the second car the previous 
has round numbers many duplicates the updated

77
00:09:31,920 --> 00:09:37,260
does not have round numbers no duplicates 
so again there seems to be some difference

78
00:09:38,700 --> 00:09:46,500
um between these columns for the previous 
column and the update column now we see the

79
00:09:46,500 --> 00:09:52,560
same type of rounding effect in the replicate 
analysis and the duplicate digit analysis

80
00:09:53,340 --> 00:10:00,540
but I'm going to skip to the the trailing 
digits which shows this perhaps the most clearly

81
00:10:01,980 --> 00:10:09,480
right this uh chart is looking at the the relative 
frequency of the last digits of numbers so

82
00:10:09,480 --> 00:10:16,860
again for the First Column you can see that 20 
percent roughly of the numbers end with a zero

83
00:10:17,460 --> 00:10:23,520
and then the rest are uniform compared to the 
update where all of the values are uniform and

84
00:10:23,520 --> 00:10:32,040
again we see the same pattern in the second 
third fourth cars so the last thing I'm going

85
00:10:32,040 --> 00:10:38,820
to switch to is this last analysis listed 
here for checking your domain expectations

86
00:10:40,560 --> 00:10:48,240
um and this provides a collection of different 
visualizations to help you analyze your data

87
00:10:49,080 --> 00:10:55,320
so the first thing I'm going to do is I'm going to 
pull in our odometer reading one for the previous

88
00:10:55,320 --> 00:11:01,380
and update this is what we've been looking 
at and this is going to plot a scatter plot

89
00:11:02,820 --> 00:11:11,220
so if I reduce the opacity here 
and let me reset this actually

90
00:11:12,900 --> 00:11:18,780
great so now that we have this I want to 
just point out a couple quick things the

91
00:11:18,780 --> 00:11:24,480
bottom axis is the previous and the 
top X the y-axis is the the update

92
00:11:25,140 --> 00:11:31,800
so to read this if you look at you know a point 
right here that means the car started with 200

93
00:11:31,800 --> 00:11:38,460
000 miles and then didn't drive it all at the end 
of this time period it was still at 200 000 miles

94
00:11:38,460 --> 00:11:46,620
so in other words everything on this sort of uh 
line at the bottom of this boundary are all cars

95
00:11:46,620 --> 00:11:51,840
that drove zero miles or very close to zero miles 
so it's good we don't see anything below this

96
00:11:53,100 --> 00:12:01,560
um because those would all indicate negative 
milestone on the other hand if we look at

97
00:12:01,560 --> 00:12:05,700
this Top Line right we can look at for 
instance this point this is a car that

98
00:12:06,780 --> 00:12:15,180
started with odometer reading of zero drove 50 
000 miles and then at the endless idea drove 50.

99
00:12:15,720 --> 00:12:24,060
yeah it had 50 000 miles greater on its odometer 
so everything on this upper line are cars that

100
00:12:24,060 --> 00:12:32,220
drove 50 000 miles so given the the number 
of cars roughly fifteen thousand total um and

101
00:12:32,220 --> 00:12:38,940
the the density there it is a bit strange to see 
this very abrupt cutoff with so many cars driving

102
00:12:39,840 --> 00:12:48,360
close to 50 000 miles but no cars driving you 
know 51 000 miles or or any any any value greater

103
00:12:51,360 --> 00:12:58,920
okay so that concludes the the 
abbreviated version of this case study

104
00:12:59,820 --> 00:13:08,820
if you're interested in learning more about the 
tool check out the paper thank you very much yeah