1
00:00:00,350 --> 00:00:02,150
Hello. I'm Devin Lange.

2
00:00:02,150 --> 00:00:02,466
I'm a

3
00:00:02,466 --> 00:00:06,450
PhD student at the University of Utah and
a member of the Visualization Design lab.

4
00:00:06,900 --> 00:00:09,766
I will be presenting work done by myself
Shaurya

5
00:00:09,766 --> 00:00:12,750
Sahai, Jeff Phillips and Alexander Lex.

6
00:00:12,833 --> 00:00:16,016
In this presentation,
I will discuss academic misconduct,

7
00:00:16,583 --> 00:00:21,150
specifically the fabrication
and falsification of tabular datasets.

8
00:00:21,500 --> 00:00:24,983
In addition,
I will discuss our ideas and contributions

9
00:00:24,983 --> 00:00:27,916
towards battling this problem.

10
00:00:28,050 --> 00:00:30,166
We believe that most scientists work

11
00:00:30,166 --> 00:00:32,716
hard to expand knowledge
and improve society.

12
00:00:33,166 --> 00:00:38,116
Unfortunately, some cheat which can lead
to devastating consequences.

13
00:00:39,000 --> 00:00:42,266
One notable example of
this is key Alzheimer's research

14
00:00:42,266 --> 00:00:43,883
that is now in question.

15
00:00:43,883 --> 00:00:48,566
This research led to 15 years of misguided
focus in the scientific community

16
00:00:48,800 --> 00:00:51,950
as well as the development
of pharmaceutical drugs.

17
00:00:52,650 --> 00:00:55,366
All of this work
now has a massive question mark above it

18
00:00:55,366 --> 00:00:59,250
because the foundational research contains
data that appears to be manipulated.

19
00:00:59,750 --> 00:01:03,000
And this is just one
example of many troubling cases in

20
00:01:03,000 --> 00:01:05,266
just the last few years.

21
00:01:05,266 --> 00:01:07,766
So what can we do?

22
00:01:07,766 --> 00:01:11,100
First, we can look at what is already
being done for similar problems.

23
00:01:11,416 --> 00:01:15,583
Plagiarism in writing is something that
editors of journals check for with purpose

24
00:01:15,583 --> 00:01:20,033
built software such as iThenticate tools
also exists for detecting

25
00:01:20,033 --> 00:01:24,750
duplicated regions in image data, often
a sign that the data has been manipulated.

26
00:01:25,433 --> 00:01:27,950
But what about tabular datasets?

27
00:01:28,850 --> 00:01:30,566
Let's examine another story.

28
00:01:31,733 --> 00:01:34,516
In this particular story,

29
00:01:34,516 --> 00:01:37,583
we look at a junior research
faculty member, Dr.

30
00:01:37,583 --> 00:01:38,966
Kate Laskowski.

31
00:01:38,966 --> 00:01:41,866
Early in her career,
she started a new collaboration

32
00:01:41,866 --> 00:01:45,500
with a prominent member of her field
who provided the data

33
00:01:45,866 --> 00:01:49,616
for a paper they published together
about the behavior of spiders.

34
00:01:50,366 --> 00:01:54,083
Later, after a closer investigation,
she determined the tabular datasets

35
00:01:54,083 --> 00:01:56,150
he provided could not be trusted

36
00:01:57,600 --> 00:01:59,933
in this incredibly difficult situation.

37
00:01:59,933 --> 00:02:05,600
She's decided to retract the paper
and publish a blog chronicling the story

38
00:02:05,933 --> 00:02:08,366
about how she came to her conclusions

39
00:02:08,850 --> 00:02:10,650
for this data,

40
00:02:12,416 --> 00:02:14,100
I would like to propose

41
00:02:14,100 --> 00:02:18,000
three high level steps
we can take in the scientific community

42
00:02:18,000 --> 00:02:20,933
to help prevent
the manipulation of tabular datasets.

43
00:02:21,616 --> 00:02:24,166
First, it's important to understand

44
00:02:24,166 --> 00:02:26,183
properties of manipulated data.

45
00:02:27,166 --> 00:02:31,066
Next,
I think there's room to create more tools

46
00:02:31,066 --> 00:02:33,566
to help identify manipulated data.

47
00:02:34,200 --> 00:02:39,083
And finally, we we must consider
how best to implement best practices,

48
00:02:39,083 --> 00:02:43,650
both individually and as a community,
to reduce data manipulation.

49
00:02:45,683 --> 00:02:46,733
To begin

50
00:02:46,733 --> 00:02:51,316
understanding fraud can be challenging due
to the adversarial nature of this topic.

51
00:02:52,100 --> 00:02:55,266
The way that we approach this for
our work was by examining

52
00:02:55,283 --> 00:02:59,600
tabular datasets
associated with retracted papers or papers

53
00:02:59,600 --> 00:03:02,866
that had an expression of concern issued.

54
00:03:03,833 --> 00:03:05,933
We also reviewed the public arguments

55
00:03:05,933 --> 00:03:09,233
for why these papers should be retracted.

56
00:03:09,233 --> 00:03:11,816
Ultimately, we found ten datasets

57
00:03:12,050 --> 00:03:15,533
from the field of biology,
medicine, psychology and marketing.

58
00:03:15,800 --> 00:03:18,200
We have labeled each with a unique key.

59
00:03:18,200 --> 00:03:21,500
After investigating all these datasets,
we identified patterns

60
00:03:21,500 --> 00:03:23,150
that exist across them.

61
00:03:23,150 --> 00:03:26,783
One pattern that we identified
is unexpected formatting,

62
00:03:27,150 --> 00:03:29,400
which I will discuss in detail later.

63
00:03:31,283 --> 00:03:35,000
We also recorded the four datasets

64
00:03:35,150 --> 00:03:38,216
that exhibit this pattern.

65
00:03:38,216 --> 00:03:43,950
We did this for seven other patterns
and organize them into four higher level

66
00:03:43,950 --> 00:03:47,483
themes format in numerical, structural
and domain.

67
00:03:48,316 --> 00:03:52,466
We refer to these patterns
as artifacts of manipulation.

68
00:03:54,716 --> 00:03:55,066
You may

69
00:03:55,066 --> 00:03:58,816
notice that we have included unexpected
leading digits as an artifact,

70
00:03:58,850 --> 00:04:02,450
even though we did not identify it in
any of our datasets.

71
00:04:02,850 --> 00:04:06,116
This is because Benford's law,
which describes

72
00:04:06,116 --> 00:04:10,200
the expected frequency of leading digits,
is a well-established technique

73
00:04:10,200 --> 00:04:13,550
for identifying fraud,
especially in financial data.

74
00:04:14,116 --> 00:04:18,166
However,
we found that the criteria required to run

75
00:04:18,166 --> 00:04:21,116
this test were frequently not met.

76
00:04:22,316 --> 00:04:25,733
Our goal was to organize these artifacts
so that they are general enough

77
00:04:25,733 --> 00:04:28,850
to cover a wide range of scenarios
while still being useful.

78
00:04:28,850 --> 00:04:31,616
And we believe we have successfully struck
that balance.

79
00:04:32,550 --> 00:04:36,200
However, we acknowledge that there are
artifacts that we may not have captured.

80
00:04:36,533 --> 00:04:40,816
So we have created a living document
online that lists what we have found

81
00:04:40,816 --> 00:04:45,000
and invites others to suggest changes
or additions to this document.

82
00:04:45,600 --> 00:04:49,366
Moving on to the second point, right now,
there's a gap in tools,

83
00:04:49,366 --> 00:04:52,800
aid and data, forensics
for tabular data for plagiarism.

84
00:04:52,800 --> 00:04:56,783
There are purpose built tools such
as authenticate, which are widely used.

85
00:04:56,783 --> 00:04:59,966
But to our knowledge,
no such tool exists for tabular data.

86
00:05:01,100 --> 00:05:04,316
There are various statistical tests
that can check

87
00:05:04,316 --> 00:05:08,100
for the plausibility of data,
such as Bedford's law, or determining

88
00:05:08,100 --> 00:05:12,000
the probability that a dataset contains
a duplicate number.

89
00:05:12,316 --> 00:05:16,800
However, these require careful setup
and cannot always be used.

90
00:05:16,883 --> 00:05:18,116
Furthermore,

91
00:05:18,116 --> 00:05:22,066
we worry that applying these tests
when it isn't appropriate would result in

92
00:05:22,500 --> 00:05:26,750
false positives
viewed with high confidence by the user

93
00:05:27,050 --> 00:05:30,150
because they come with the
the authority of rigorous math

94
00:05:31,516 --> 00:05:35,550
as a result, our tool highlights
these artifacts of manipulation,

95
00:05:35,816 --> 00:05:39,683
but also embeds advice
on what can cause the artifact,

96
00:05:39,983 --> 00:05:42,316
including benign causes

97
00:05:44,183 --> 00:05:46,283
Introducing variant.

98
00:05:47,250 --> 00:05:49,883
On the left hand side,
a list of analysis is available.

99
00:05:49,883 --> 00:05:53,000
Each analysis is designed to highlight
a different artifact.

100
00:05:53,900 --> 00:05:56,850
The analysis explanation is the embedded
advice.

101
00:05:57,000 --> 00:06:00,116
It introduces the pattern, explains
what to look for

102
00:06:00,350 --> 00:06:03,300
and gives warnings
on how it could be misinterpreted.

103
00:06:04,733 --> 00:06:05,600
The summary

104
00:06:05,600 --> 00:06:09,533
charts provide aggregate information
about the column of data below,

105
00:06:09,950 --> 00:06:13,816
even though the visual and coatings
are simple, the data transformations

106
00:06:13,816 --> 00:06:17,866
performed are carefully crafted to surface
artifacts of manipulation.

107
00:06:18,416 --> 00:06:21,833
Finally,
the tabular visualization gives access

108
00:06:21,833 --> 00:06:23,933
to the raw data.

109
00:06:25,250 --> 00:06:26,716
So we've talked about these two

110
00:06:26,716 --> 00:06:30,800
ideas of understanding fraud
and creating tools clearly

111
00:06:31,050 --> 00:06:36,416
bear linked understanding fraud, directly
guided the design choices affair. In

112
00:06:36,416 --> 00:06:40,283
other words, once you know the patterns,
you can create tools to highlight them.

113
00:06:42,416 --> 00:06:44,100
Now, I will spend some time going

114
00:06:44,100 --> 00:06:46,850
into the details
for some of the artifacts of manipulation.

115
00:06:47,183 --> 00:06:49,166
For details on the others.

116
00:06:49,166 --> 00:06:52,550
The user referred to the paper
or the living document of artifacts.

117
00:06:53,783 --> 00:06:55,850
To begin, I will discuss unexpected

118
00:06:56,000 --> 00:06:59,416
formatting
with an example from the driving dataset.

119
00:07:00,650 --> 00:07:03,616
Formatting here
does not refer to the values of the data,

120
00:07:03,616 --> 00:07:07,200
but rather things
that change the appearance of the data.

121
00:07:07,200 --> 00:07:09,950
In formatting tools like Microsoft Excel.

122
00:07:10,666 --> 00:07:13,850
This includes things like the font,
font size and methods

123
00:07:13,850 --> 00:07:17,100
of text emphasis
such as bold, italic or underlined.

124
00:07:18,266 --> 00:07:21,266
Variation in formatting is
is not automatically suspicious.

125
00:07:21,316 --> 00:07:24,266
For instance,
it is common to emphasize column headers.

126
00:07:24,683 --> 00:07:27,566
However,
there can be more subtle variations.

127
00:07:27,766 --> 00:07:31,050
Unless you have an incredibly keen eye,
you probably didn't notice

128
00:07:31,050 --> 00:07:34,500
that I've formatted these numbers
in two different fonts.

129
00:07:36,383 --> 00:07:37,733
While not immediately

130
00:07:37,733 --> 00:07:39,800
incriminating, I would consider this to be

131
00:07:41,033 --> 00:07:45,050
suspicious or at the very least strange
and worth investigating.

132
00:07:45,050 --> 00:07:48,983
Further, this exact pattern exists
in one of the retracted datasets

133
00:07:49,250 --> 00:07:54,016
this column represents
the odometer reading for 13,488 cars.

134
00:07:54,600 --> 00:07:56,966
This formatting difference
can be visually subtle.

135
00:07:57,266 --> 00:07:59,116
We do not recreate them in the tool.

136
00:07:59,116 --> 00:08:01,816
Instead, we choose to highlight
when there are differences.

137
00:08:01,816 --> 00:08:06,666
In this case blue for Cambria font
and white for Calibri

138
00:08:06,750 --> 00:08:10,666
switching to overview
mode lets more rows fit on the screen

139
00:08:11,633 --> 00:08:15,050
and sorting the
data reveals some interesting patterns.

140
00:08:15,050 --> 00:08:18,450
All of the values of zero in this data

141
00:08:18,600 --> 00:08:21,833
are in the white font.

142
00:08:24,416 --> 00:08:25,966
Most of the values

143
00:08:25,966 --> 00:08:29,066
between 300 and 1,000 are blue.

144
00:08:31,133 --> 00:08:33,233
Scrolling through the data, we can see

145
00:08:33,233 --> 00:08:36,650
regions of white at round numbers,

146
00:08:38,100 --> 00:08:40,250
but we don't see the same effect

147
00:08:40,500 --> 00:08:44,150
for the blue values.

148
00:08:44,966 --> 00:08:46,100
Lastly,

149
00:08:46,100 --> 00:08:48,950
skipping
to the largest values in the dataset.

150
00:08:50,333 --> 00:08:52,316
If we select

151
00:08:52,316 --> 00:08:56,116
all of these values and expand them,
we can see that the fonts

152
00:08:56,516 --> 00:08:59,100
all alternate perfectly between

153
00:08:59,700 --> 00:09:03,300
blue and white.

154
00:09:06,900 --> 00:09:07,733
Moving on,

155
00:09:07,733 --> 00:09:10,583
I will discuss
one type of numerical artifact.

156
00:09:11,100 --> 00:09:14,850
Miracle artifacts
relate to an actual values, recorded

157
00:09:15,716 --> 00:09:17,966
duplicate numbers and digits referred to.

158
00:09:17,966 --> 00:09:19,466
One whole numbers

159
00:09:19,466 --> 00:09:23,066
or sequences of digits are repeated
more frequently than expected.

160
00:09:23,416 --> 00:09:26,633
This can suggest
that data may have been copy pasted

161
00:09:26,633 --> 00:09:29,716
or manually entered evaluated.

162
00:09:29,716 --> 00:09:33,266
If the number of duplicates is more
than expected requires understanding

163
00:09:33,450 --> 00:09:36,683
the position of the data
and the number of samples.

164
00:09:37,250 --> 00:09:41,066
Finding a few duplicates in a large
dataset is not suspicious,

165
00:09:41,066 --> 00:09:46,133
but finding many duplicates and small data
sets may warrant a closer look.

166
00:09:48,083 --> 00:09:49,100
This dataset

167
00:09:49,100 --> 00:09:53,033
represents the amount of time in seconds
it takes for a spider

168
00:09:53,033 --> 00:09:58,766
to reemerge from its enclosure,
a proxy for that spider’s boldness.

169
00:09:58,850 --> 00:10:03,466
600 or 10 minutes is duplicated
many times in this dataset.

170
00:10:04,350 --> 00:10:06,750
However,
the period was capped at 10 minutes,

171
00:10:06,750 --> 00:10:09,750
so it's reasonable
to ignore that in our analysis

172
00:10:11,333 --> 00:10:13,883
in this stage is that
there are many values that are duplicated

173
00:10:13,916 --> 00:10:17,250
two or three times watching values

174
00:10:17,250 --> 00:10:20,700
will highlight them
and bring them to the top of our list.

175
00:10:20,900 --> 00:10:24,650
Here the number 104 shows up three times,
which we may consider

176
00:10:24,650 --> 00:10:28,133
moderately suspicious given the precision
and the size of this dataset,

177
00:10:29,900 --> 00:10:33,866
structural artifacts
relate to the values of spreadsheets,

178
00:10:34,050 --> 00:10:37,700
as well as their position
within the spreadsheet.

179
00:10:38,783 --> 00:10:42,466
A region refers to multiple cells
with a special relationship.

180
00:10:43,400 --> 00:10:46,950
This could mean cells
that are adjacent in Collins rows

181
00:10:46,950 --> 00:10:50,216
or even nearby cells
with gaps in between them.

182
00:10:51,716 --> 00:10:55,050
In some scenarios, a few duplicate numbers
may be a weak

183
00:10:55,050 --> 00:10:58,016
signal of manipulation
or not suspicious at all.

184
00:10:59,516 --> 00:11:01,550
But repeated regions

185
00:11:01,550 --> 00:11:04,966
are a much stronger signal
that something suspicious

186
00:11:04,966 --> 00:11:06,350
is happening.

187
00:11:08,550 --> 00:11:12,016
Our, last example
that we've looked at actually includes

188
00:11:12,016 --> 00:11:15,166
an example of this here, a complete row

189
00:11:15,166 --> 00:11:18,200
with the numbers 85, 180, 104.

190
00:11:18,200 --> 00:11:22,200
228.1 and 151.34

191
00:11:22,583 --> 00:11:25,700
seems to have been copied.

192
00:11:26,150 --> 00:11:27,983
Finally, domain artifacts

193
00:11:27,983 --> 00:11:31,316
apply knowledge beyond the single dataset.

194
00:11:32,100 --> 00:11:36,000
In a given domain, there may be prior
knowledge about what is an expected

195
00:11:36,000 --> 00:11:40,283
distribution for single dimensional data
or expected relationships.

196
00:11:40,283 --> 00:11:42,750
Between multidimensional data.

197
00:11:42,750 --> 00:11:47,450
We consider deviations
from these expectations to be artifacts.

198
00:11:49,000 --> 00:11:51,316
Let's consider a hypothetical dataset

199
00:11:51,316 --> 00:11:54,083
that measures the height
and weight of people.

200
00:11:54,866 --> 00:11:57,500
It is possible that more sophisticated

201
00:11:57,500 --> 00:12:01,466
techniques for generating fabricated data,
such as scripts

202
00:12:02,100 --> 00:12:05,816
may not leave behind the early artifacts
we've discussed.

203
00:12:06,683 --> 00:12:08,866
Furthermore,
the distribution of each column

204
00:12:09,300 --> 00:12:11,416
may also match to meeting expectations.

205
00:12:12,233 --> 00:12:15,050
However,
if cons are generated independently,

206
00:12:15,166 --> 00:12:19,166
they may exhibit strange relationships
such as no correlation

207
00:12:19,166 --> 00:12:23,866
when there should be one or even one
that's opposite from expectations

208
00:12:24,016 --> 00:12:28,016
as seen here,
where taller people tend to wave less.

209
00:12:29,183 --> 00:12:29,900
In this case, it

210
00:12:29,900 --> 00:12:34,700
isn't difficult to think of ways
to generate 2 columns

211
00:12:34,850 --> 00:12:37,616
with appropriate relationships,

212
00:12:38,600 --> 00:12:43,200
but this gets more difficult
as you introduce more columns.

213
00:12:43,883 --> 00:12:47,400
Every additional column introduces
a new set of relationships

214
00:12:47,400 --> 00:12:51,166
that can expose a potential flaw
in the data fabrication process.

215
00:12:53,066 --> 00:12:54,866
Looking again at the driving dataset,

216
00:12:54,866 --> 00:13:00,383
we can look at two columns interactions
the odometer readings of cars

217
00:13:00,383 --> 00:13:03,616
at one point in time,
and the odometer reading

218
00:13:03,616 --> 00:13:06,016
of the same car later in totted.

219
00:13:06,983 --> 00:13:09,500
Each point corresponds to a single car.

220
00:13:10,350 --> 00:13:15,016
A point here would be a car
that started with zero miles

221
00:13:15,016 --> 00:13:19,916
on their odometer
and ended with approximately 25,000 miles

222
00:13:21,166 --> 00:13:24,900
and this would be a car
that started at roughly 100,000

223
00:13:24,900 --> 00:13:28,650
miles and ended at 125,000 miles.

224
00:13:29,516 --> 00:13:32,400
The points on this line correspond to cars

225
00:13:32,400 --> 00:13:36,233
that drove zero miles in this time period.

226
00:13:37,466 --> 00:13:38,850
Similarly, the cars on

227
00:13:38,850 --> 00:13:43,366
this line drove exactly 50,000 miles.

228
00:13:43,666 --> 00:13:48,450
We can notice in this plot here
that there are zero cars

229
00:13:48,450 --> 00:13:53,550
that drove more than 50,000 miles,
despite many driving near the boundary.

230
00:13:53,816 --> 00:13:58,250
Given that there are many cars between
45,000 and 50,000 thousand miles driven,

231
00:13:58,550 --> 00:14:02,700
we would also expect there to be at least
a few with more than 50,000 miles.

232
00:14:03,116 --> 00:14:04,966
That brings us to our last main point.

233
00:14:04,966 --> 00:14:08,300
Preventing manipulation of data
will require more than just tools,

234
00:14:08,300 --> 00:14:12,566
and knowledge will require changes
to the policies and best practices

235
00:14:12,566 --> 00:14:14,300
of the scientific community.

236
00:14:14,300 --> 00:14:18,000
One idea we think is worth exploring
is for journal editors to review data

237
00:14:18,000 --> 00:14:21,916
sets for manipulation similarly to how
they currently check for plagiarism.

238
00:14:22,516 --> 00:14:25,666
Individuals can also perform
similar checks when they receive data

239
00:14:25,666 --> 00:14:26,816
from their collaborators.

240
00:14:27,916 --> 00:14:28,400
This is a

241
00:14:28,400 --> 00:14:31,650
difficult problem and we don't claim
to have all the answers.

242
00:14:31,883 --> 00:14:35,700
There are potential unintended
consequences with this work.

243
00:14:36,150 --> 00:14:40,466
For instance, false positives
could still lead to misguided accusations.

244
00:14:40,466 --> 00:14:44,216
But because of this, we believe tools
like ours should be used in

245
00:14:44,216 --> 00:14:48,300
the review process where authors
can respond to concerns with their data.

246
00:14:48,316 --> 00:14:51,716
And the result is a rejected paper,
not an accusation

247
00:14:51,716 --> 00:14:53,783
that can threaten an author's career.

248
00:14:55,066 --> 00:14:58,066
There's also a possibility
that the bad actors

249
00:14:58,066 --> 00:15:02,000
could use tools like ours
to improve their falsification data.

250
00:15:02,666 --> 00:15:07,583
And while some may do this well,
we have seen with existing plagiarism

251
00:15:07,583 --> 00:15:12,316
tools as they still continue
to catch many instances of cheating.

252
00:15:13,516 --> 00:15:17,416
Still, it is important to consider
these types of broader impacts.

253
00:15:17,816 --> 00:15:22,166
We discussed these two concerns as well
as others in more detail in our paper.

254
00:15:22,666 --> 00:15:26,633
At the end of the day, even after spending
so much time looking at these cases

255
00:15:26,633 --> 00:15:30,200
of data manipulation, we still believe
in the scientific community,

256
00:15:31,550 --> 00:15:33,316
but we can and should

257
00:15:33,316 --> 00:15:36,316
do more to prevent
the manipulation of tabular datasets.

258
00:15:37,133 --> 00:15:40,916
We have taken a first step and we urge you
to consider how you can help

259
00:15:40,916 --> 00:15:44,750
the community understand
fraud, build tools and implement

260
00:15:44,750 --> 00:15:46,850
best practices.