1 00:00:00,350 --> 00:00:02,150 Hello. I'm Devin Lange. 2 00:00:02,150 --> 00:00:02,466 I'm a 3 00:00:02,466 --> 00:00:06,450 PhD student at the University of Utah and a member of the Visualization Design lab. 4 00:00:06,900 --> 00:00:09,766 I will be presenting work done by myself Shaurya 5 00:00:09,766 --> 00:00:12,750 Sahai, Jeff Phillips and Alexander Lex. 6 00:00:12,833 --> 00:00:16,016 In this presentation, I will discuss academic misconduct, 7 00:00:16,583 --> 00:00:21,150 specifically the fabrication and falsification of tabular datasets. 8 00:00:21,500 --> 00:00:24,983 In addition, I will discuss our ideas and contributions 9 00:00:24,983 --> 00:00:27,916 towards battling this problem. 10 00:00:28,050 --> 00:00:30,166 We believe that most scientists work 11 00:00:30,166 --> 00:00:32,716 hard to expand knowledge and improve society. 12 00:00:33,166 --> 00:00:38,116 Unfortunately, some cheat which can lead to devastating consequences. 13 00:00:39,000 --> 00:00:42,266 One notable example of this is key Alzheimer's research 14 00:00:42,266 --> 00:00:43,883 that is now in question. 15 00:00:43,883 --> 00:00:48,566 This research led to 15 years of misguided focus in the scientific community 16 00:00:48,800 --> 00:00:51,950 as well as the development of pharmaceutical drugs. 17 00:00:52,650 --> 00:00:55,366 All of this work now has a massive question mark above it 18 00:00:55,366 --> 00:00:59,250 because the foundational research contains data that appears to be manipulated. 19 00:00:59,750 --> 00:01:03,000 And this is just one example of many troubling cases in 20 00:01:03,000 --> 00:01:05,266 just the last few years. 21 00:01:05,266 --> 00:01:07,766 So what can we do? 22 00:01:07,766 --> 00:01:11,100 First, we can look at what is already being done for similar problems. 23 00:01:11,416 --> 00:01:15,583 Plagiarism in writing is something that editors of journals check for with purpose 24 00:01:15,583 --> 00:01:20,033 built software such as iThenticate tools also exists for detecting 25 00:01:20,033 --> 00:01:24,750 duplicated regions in image data, often a sign that the data has been manipulated. 26 00:01:25,433 --> 00:01:27,950 But what about tabular datasets? 27 00:01:28,850 --> 00:01:30,566 Let's examine another story. 28 00:01:31,733 --> 00:01:34,516 In this particular story, 29 00:01:34,516 --> 00:01:37,583 we look at a junior research faculty member, Dr. 30 00:01:37,583 --> 00:01:38,966 Kate Laskowski. 31 00:01:38,966 --> 00:01:41,866 Early in her career, she started a new collaboration 32 00:01:41,866 --> 00:01:45,500 with a prominent member of her field who provided the data 33 00:01:45,866 --> 00:01:49,616 for a paper they published together about the behavior of spiders. 34 00:01:50,366 --> 00:01:54,083 Later, after a closer investigation, she determined the tabular datasets 35 00:01:54,083 --> 00:01:56,150 he provided could not be trusted 36 00:01:57,600 --> 00:01:59,933 in this incredibly difficult situation. 37 00:01:59,933 --> 00:02:05,600 She's decided to retract the paper and publish a blog chronicling the story 38 00:02:05,933 --> 00:02:08,366 about how she came to her conclusions 39 00:02:08,850 --> 00:02:10,650 for this data, 40 00:02:12,416 --> 00:02:14,100 I would like to propose 41 00:02:14,100 --> 00:02:18,000 three high level steps we can take in the scientific community 42 00:02:18,000 --> 00:02:20,933 to help prevent the manipulation of tabular datasets. 43 00:02:21,616 --> 00:02:24,166 First, it's important to understand 44 00:02:24,166 --> 00:02:26,183 properties of manipulated data. 45 00:02:27,166 --> 00:02:31,066 Next, I think there's room to create more tools 46 00:02:31,066 --> 00:02:33,566 to help identify manipulated data. 47 00:02:34,200 --> 00:02:39,083 And finally, we we must consider how best to implement best practices, 48 00:02:39,083 --> 00:02:43,650 both individually and as a community, to reduce data manipulation. 49 00:02:45,683 --> 00:02:46,733 To begin 50 00:02:46,733 --> 00:02:51,316 understanding fraud can be challenging due to the adversarial nature of this topic. 51 00:02:52,100 --> 00:02:55,266 The way that we approach this for our work was by examining 52 00:02:55,283 --> 00:02:59,600 tabular datasets associated with retracted papers or papers 53 00:02:59,600 --> 00:03:02,866 that had an expression of concern issued. 54 00:03:03,833 --> 00:03:05,933 We also reviewed the public arguments 55 00:03:05,933 --> 00:03:09,233 for why these papers should be retracted. 56 00:03:09,233 --> 00:03:11,816 Ultimately, we found ten datasets 57 00:03:12,050 --> 00:03:15,533 from the field of biology, medicine, psychology and marketing. 58 00:03:15,800 --> 00:03:18,200 We have labeled each with a unique key. 59 00:03:18,200 --> 00:03:21,500 After investigating all these datasets, we identified patterns 60 00:03:21,500 --> 00:03:23,150 that exist across them. 61 00:03:23,150 --> 00:03:26,783 One pattern that we identified is unexpected formatting, 62 00:03:27,150 --> 00:03:29,400 which I will discuss in detail later. 63 00:03:31,283 --> 00:03:35,000 We also recorded the four datasets 64 00:03:35,150 --> 00:03:38,216 that exhibit this pattern. 65 00:03:38,216 --> 00:03:43,950 We did this for seven other patterns and organize them into four higher level 66 00:03:43,950 --> 00:03:47,483 themes format in numerical, structural and domain. 67 00:03:48,316 --> 00:03:52,466 We refer to these patterns as artifacts of manipulation. 68 00:03:54,716 --> 00:03:55,066 You may 69 00:03:55,066 --> 00:03:58,816 notice that we have included unexpected leading digits as an artifact, 70 00:03:58,850 --> 00:04:02,450 even though we did not identify it in any of our datasets. 71 00:04:02,850 --> 00:04:06,116 This is because Benford's law, which describes 72 00:04:06,116 --> 00:04:10,200 the expected frequency of leading digits, is a well-established technique 73 00:04:10,200 --> 00:04:13,550 for identifying fraud, especially in financial data. 74 00:04:14,116 --> 00:04:18,166 However, we found that the criteria required to run 75 00:04:18,166 --> 00:04:21,116 this test were frequently not met. 76 00:04:22,316 --> 00:04:25,733 Our goal was to organize these artifacts so that they are general enough 77 00:04:25,733 --> 00:04:28,850 to cover a wide range of scenarios while still being useful. 78 00:04:28,850 --> 00:04:31,616 And we believe we have successfully struck that balance. 79 00:04:32,550 --> 00:04:36,200 However, we acknowledge that there are artifacts that we may not have captured. 80 00:04:36,533 --> 00:04:40,816 So we have created a living document online that lists what we have found 81 00:04:40,816 --> 00:04:45,000 and invites others to suggest changes or additions to this document. 82 00:04:45,600 --> 00:04:49,366 Moving on to the second point, right now, there's a gap in tools, 83 00:04:49,366 --> 00:04:52,800 aid and data, forensics for tabular data for plagiarism. 84 00:04:52,800 --> 00:04:56,783 There are purpose built tools such as authenticate, which are widely used. 85 00:04:56,783 --> 00:04:59,966 But to our knowledge, no such tool exists for tabular data. 86 00:05:01,100 --> 00:05:04,316 There are various statistical tests that can check 87 00:05:04,316 --> 00:05:08,100 for the plausibility of data, such as Bedford's law, or determining 88 00:05:08,100 --> 00:05:12,000 the probability that a dataset contains a duplicate number. 89 00:05:12,316 --> 00:05:16,800 However, these require careful setup and cannot always be used. 90 00:05:16,883 --> 00:05:18,116 Furthermore, 91 00:05:18,116 --> 00:05:22,066 we worry that applying these tests when it isn't appropriate would result in 92 00:05:22,500 --> 00:05:26,750 false positives viewed with high confidence by the user 93 00:05:27,050 --> 00:05:30,150 because they come with the the authority of rigorous math 94 00:05:31,516 --> 00:05:35,550 as a result, our tool highlights these artifacts of manipulation, 95 00:05:35,816 --> 00:05:39,683 but also embeds advice on what can cause the artifact, 96 00:05:39,983 --> 00:05:42,316 including benign causes 97 00:05:44,183 --> 00:05:46,283 Introducing variant. 98 00:05:47,250 --> 00:05:49,883 On the left hand side, a list of analysis is available. 99 00:05:49,883 --> 00:05:53,000 Each analysis is designed to highlight a different artifact. 100 00:05:53,900 --> 00:05:56,850 The analysis explanation is the embedded advice. 101 00:05:57,000 --> 00:06:00,116 It introduces the pattern, explains what to look for 102 00:06:00,350 --> 00:06:03,300 and gives warnings on how it could be misinterpreted. 103 00:06:04,733 --> 00:06:05,600 The summary 104 00:06:05,600 --> 00:06:09,533 charts provide aggregate information about the column of data below, 105 00:06:09,950 --> 00:06:13,816 even though the visual and coatings are simple, the data transformations 106 00:06:13,816 --> 00:06:17,866 performed are carefully crafted to surface artifacts of manipulation. 107 00:06:18,416 --> 00:06:21,833 Finally, the tabular visualization gives access 108 00:06:21,833 --> 00:06:23,933 to the raw data. 109 00:06:25,250 --> 00:06:26,716 So we've talked about these two 110 00:06:26,716 --> 00:06:30,800 ideas of understanding fraud and creating tools clearly 111 00:06:31,050 --> 00:06:36,416 bear linked understanding fraud, directly guided the design choices affair. In 112 00:06:36,416 --> 00:06:40,283 other words, once you know the patterns, you can create tools to highlight them. 113 00:06:42,416 --> 00:06:44,100 Now, I will spend some time going 114 00:06:44,100 --> 00:06:46,850 into the details for some of the artifacts of manipulation. 115 00:06:47,183 --> 00:06:49,166 For details on the others. 116 00:06:49,166 --> 00:06:52,550 The user referred to the paper or the living document of artifacts. 117 00:06:53,783 --> 00:06:55,850 To begin, I will discuss unexpected 118 00:06:56,000 --> 00:06:59,416 formatting with an example from the driving dataset. 119 00:07:00,650 --> 00:07:03,616 Formatting here does not refer to the values of the data, 120 00:07:03,616 --> 00:07:07,200 but rather things that change the appearance of the data. 121 00:07:07,200 --> 00:07:09,950 In formatting tools like Microsoft Excel. 122 00:07:10,666 --> 00:07:13,850 This includes things like the font, font size and methods 123 00:07:13,850 --> 00:07:17,100 of text emphasis such as bold, italic or underlined. 124 00:07:18,266 --> 00:07:21,266 Variation in formatting is is not automatically suspicious. 125 00:07:21,316 --> 00:07:24,266 For instance, it is common to emphasize column headers. 126 00:07:24,683 --> 00:07:27,566 However, there can be more subtle variations. 127 00:07:27,766 --> 00:07:31,050 Unless you have an incredibly keen eye, you probably didn't notice 128 00:07:31,050 --> 00:07:34,500 that I've formatted these numbers in two different fonts. 129 00:07:36,383 --> 00:07:37,733 While not immediately 130 00:07:37,733 --> 00:07:39,800 incriminating, I would consider this to be 131 00:07:41,033 --> 00:07:45,050 suspicious or at the very least strange and worth investigating. 132 00:07:45,050 --> 00:07:48,983 Further, this exact pattern exists in one of the retracted datasets 133 00:07:49,250 --> 00:07:54,016 this column represents the odometer reading for 13,488 cars. 134 00:07:54,600 --> 00:07:56,966 This formatting difference can be visually subtle. 135 00:07:57,266 --> 00:07:59,116 We do not recreate them in the tool. 136 00:07:59,116 --> 00:08:01,816 Instead, we choose to highlight when there are differences. 137 00:08:01,816 --> 00:08:06,666 In this case blue for Cambria font and white for Calibri 138 00:08:06,750 --> 00:08:10,666 switching to overview mode lets more rows fit on the screen 139 00:08:11,633 --> 00:08:15,050 and sorting the data reveals some interesting patterns. 140 00:08:15,050 --> 00:08:18,450 All of the values of zero in this data 141 00:08:18,600 --> 00:08:21,833 are in the white font. 142 00:08:24,416 --> 00:08:25,966 Most of the values 143 00:08:25,966 --> 00:08:29,066 between 300 and 1,000 are blue. 144 00:08:31,133 --> 00:08:33,233 Scrolling through the data, we can see 145 00:08:33,233 --> 00:08:36,650 regions of white at round numbers, 146 00:08:38,100 --> 00:08:40,250 but we don't see the same effect 147 00:08:40,500 --> 00:08:44,150 for the blue values. 148 00:08:44,966 --> 00:08:46,100 Lastly, 149 00:08:46,100 --> 00:08:48,950 skipping to the largest values in the dataset. 150 00:08:50,333 --> 00:08:52,316 If we select 151 00:08:52,316 --> 00:08:56,116 all of these values and expand them, we can see that the fonts 152 00:08:56,516 --> 00:08:59,100 all alternate perfectly between 153 00:08:59,700 --> 00:09:03,300 blue and white. 154 00:09:06,900 --> 00:09:07,733 Moving on, 155 00:09:07,733 --> 00:09:10,583 I will discuss one type of numerical artifact. 156 00:09:11,100 --> 00:09:14,850 Miracle artifacts relate to an actual values, recorded 157 00:09:15,716 --> 00:09:17,966 duplicate numbers and digits referred to. 158 00:09:17,966 --> 00:09:19,466 One whole numbers 159 00:09:19,466 --> 00:09:23,066 or sequences of digits are repeated more frequently than expected. 160 00:09:23,416 --> 00:09:26,633 This can suggest that data may have been copy pasted 161 00:09:26,633 --> 00:09:29,716 or manually entered evaluated. 162 00:09:29,716 --> 00:09:33,266 If the number of duplicates is more than expected requires understanding 163 00:09:33,450 --> 00:09:36,683 the position of the data and the number of samples. 164 00:09:37,250 --> 00:09:41,066 Finding a few duplicates in a large dataset is not suspicious, 165 00:09:41,066 --> 00:09:46,133 but finding many duplicates and small data sets may warrant a closer look. 166 00:09:48,083 --> 00:09:49,100 This dataset 167 00:09:49,100 --> 00:09:53,033 represents the amount of time in seconds it takes for a spider 168 00:09:53,033 --> 00:09:58,766 to reemerge from its enclosure, a proxy for that spider’s boldness. 169 00:09:58,850 --> 00:10:03,466 600 or 10 minutes is duplicated many times in this dataset. 170 00:10:04,350 --> 00:10:06,750 However, the period was capped at 10 minutes, 171 00:10:06,750 --> 00:10:09,750 so it's reasonable to ignore that in our analysis 172 00:10:11,333 --> 00:10:13,883 in this stage is that there are many values that are duplicated 173 00:10:13,916 --> 00:10:17,250 two or three times watching values 174 00:10:17,250 --> 00:10:20,700 will highlight them and bring them to the top of our list. 175 00:10:20,900 --> 00:10:24,650 Here the number 104 shows up three times, which we may consider 176 00:10:24,650 --> 00:10:28,133 moderately suspicious given the precision and the size of this dataset, 177 00:10:29,900 --> 00:10:33,866 structural artifacts relate to the values of spreadsheets, 178 00:10:34,050 --> 00:10:37,700 as well as their position within the spreadsheet. 179 00:10:38,783 --> 00:10:42,466 A region refers to multiple cells with a special relationship. 180 00:10:43,400 --> 00:10:46,950 This could mean cells that are adjacent in Collins rows 181 00:10:46,950 --> 00:10:50,216 or even nearby cells with gaps in between them. 182 00:10:51,716 --> 00:10:55,050 In some scenarios, a few duplicate numbers may be a weak 183 00:10:55,050 --> 00:10:58,016 signal of manipulation or not suspicious at all. 184 00:10:59,516 --> 00:11:01,550 But repeated regions 185 00:11:01,550 --> 00:11:04,966 are a much stronger signal that something suspicious 186 00:11:04,966 --> 00:11:06,350 is happening. 187 00:11:08,550 --> 00:11:12,016 Our, last example that we've looked at actually includes 188 00:11:12,016 --> 00:11:15,166 an example of this here, a complete row 189 00:11:15,166 --> 00:11:18,200 with the numbers 85, 180, 104. 190 00:11:18,200 --> 00:11:22,200 228.1 and 151.34 191 00:11:22,583 --> 00:11:25,700 seems to have been copied. 192 00:11:26,150 --> 00:11:27,983 Finally, domain artifacts 193 00:11:27,983 --> 00:11:31,316 apply knowledge beyond the single dataset. 194 00:11:32,100 --> 00:11:36,000 In a given domain, there may be prior knowledge about what is an expected 195 00:11:36,000 --> 00:11:40,283 distribution for single dimensional data or expected relationships. 196 00:11:40,283 --> 00:11:42,750 Between multidimensional data. 197 00:11:42,750 --> 00:11:47,450 We consider deviations from these expectations to be artifacts. 198 00:11:49,000 --> 00:11:51,316 Let's consider a hypothetical dataset 199 00:11:51,316 --> 00:11:54,083 that measures the height and weight of people. 200 00:11:54,866 --> 00:11:57,500 It is possible that more sophisticated 201 00:11:57,500 --> 00:12:01,466 techniques for generating fabricated data, such as scripts 202 00:12:02,100 --> 00:12:05,816 may not leave behind the early artifacts we've discussed. 203 00:12:06,683 --> 00:12:08,866 Furthermore, the distribution of each column 204 00:12:09,300 --> 00:12:11,416 may also match to meeting expectations. 205 00:12:12,233 --> 00:12:15,050 However, if cons are generated independently, 206 00:12:15,166 --> 00:12:19,166 they may exhibit strange relationships such as no correlation 207 00:12:19,166 --> 00:12:23,866 when there should be one or even one that's opposite from expectations 208 00:12:24,016 --> 00:12:28,016 as seen here, where taller people tend to wave less. 209 00:12:29,183 --> 00:12:29,900 In this case, it 210 00:12:29,900 --> 00:12:34,700 isn't difficult to think of ways to generate 2 columns 211 00:12:34,850 --> 00:12:37,616 with appropriate relationships, 212 00:12:38,600 --> 00:12:43,200 but this gets more difficult as you introduce more columns. 213 00:12:43,883 --> 00:12:47,400 Every additional column introduces a new set of relationships 214 00:12:47,400 --> 00:12:51,166 that can expose a potential flaw in the data fabrication process. 215 00:12:53,066 --> 00:12:54,866 Looking again at the driving dataset, 216 00:12:54,866 --> 00:13:00,383 we can look at two columns interactions the odometer readings of cars 217 00:13:00,383 --> 00:13:03,616 at one point in time, and the odometer reading 218 00:13:03,616 --> 00:13:06,016 of the same car later in totted. 219 00:13:06,983 --> 00:13:09,500 Each point corresponds to a single car. 220 00:13:10,350 --> 00:13:15,016 A point here would be a car that started with zero miles 221 00:13:15,016 --> 00:13:19,916 on their odometer and ended with approximately 25,000 miles 222 00:13:21,166 --> 00:13:24,900 and this would be a car that started at roughly 100,000 223 00:13:24,900 --> 00:13:28,650 miles and ended at 125,000 miles. 224 00:13:29,516 --> 00:13:32,400 The points on this line correspond to cars 225 00:13:32,400 --> 00:13:36,233 that drove zero miles in this time period. 226 00:13:37,466 --> 00:13:38,850 Similarly, the cars on 227 00:13:38,850 --> 00:13:43,366 this line drove exactly 50,000 miles. 228 00:13:43,666 --> 00:13:48,450 We can notice in this plot here that there are zero cars 229 00:13:48,450 --> 00:13:53,550 that drove more than 50,000 miles, despite many driving near the boundary. 230 00:13:53,816 --> 00:13:58,250 Given that there are many cars between 45,000 and 50,000 thousand miles driven, 231 00:13:58,550 --> 00:14:02,700 we would also expect there to be at least a few with more than 50,000 miles. 232 00:14:03,116 --> 00:14:04,966 That brings us to our last main point. 233 00:14:04,966 --> 00:14:08,300 Preventing manipulation of data will require more than just tools, 234 00:14:08,300 --> 00:14:12,566 and knowledge will require changes to the policies and best practices 235 00:14:12,566 --> 00:14:14,300 of the scientific community. 236 00:14:14,300 --> 00:14:18,000 One idea we think is worth exploring is for journal editors to review data 237 00:14:18,000 --> 00:14:21,916 sets for manipulation similarly to how they currently check for plagiarism. 238 00:14:22,516 --> 00:14:25,666 Individuals can also perform similar checks when they receive data 239 00:14:25,666 --> 00:14:26,816 from their collaborators. 240 00:14:27,916 --> 00:14:28,400 This is a 241 00:14:28,400 --> 00:14:31,650 difficult problem and we don't claim to have all the answers. 242 00:14:31,883 --> 00:14:35,700 There are potential unintended consequences with this work. 243 00:14:36,150 --> 00:14:40,466 For instance, false positives could still lead to misguided accusations. 244 00:14:40,466 --> 00:14:44,216 But because of this, we believe tools like ours should be used in 245 00:14:44,216 --> 00:14:48,300 the review process where authors can respond to concerns with their data. 246 00:14:48,316 --> 00:14:51,716 And the result is a rejected paper, not an accusation 247 00:14:51,716 --> 00:14:53,783 that can threaten an author's career. 248 00:14:55,066 --> 00:14:58,066 There's also a possibility that the bad actors 249 00:14:58,066 --> 00:15:02,000 could use tools like ours to improve their falsification data. 250 00:15:02,666 --> 00:15:07,583 And while some may do this well, we have seen with existing plagiarism 251 00:15:07,583 --> 00:15:12,316 tools as they still continue to catch many instances of cheating. 252 00:15:13,516 --> 00:15:17,416 Still, it is important to consider these types of broader impacts. 253 00:15:17,816 --> 00:15:22,166 We discussed these two concerns as well as others in more detail in our paper. 254 00:15:22,666 --> 00:15:26,633 At the end of the day, even after spending so much time looking at these cases 255 00:15:26,633 --> 00:15:30,200 of data manipulation, we still believe in the scientific community, 256 00:15:31,550 --> 00:15:33,316 but we can and should 257 00:15:33,316 --> 00:15:36,316 do more to prevent the manipulation of tabular datasets. 258 00:15:37,133 --> 00:15:40,916 We have taken a first step and we urge you to consider how you can help 259 00:15:40,916 --> 00:15:44,750 the community understand fraud, build tools and implement 260 00:15:44,750 --> 00:15:46,850 best practices.