Using Guttenberg’s doctoral thesis to test plagiarism detection systems

Professor Debora Weber-Wulff and her team from the HTW Berlin have tested various plagiarism detection systems. Instead of using an artificially created test data set, Guttenberg’s doctoral thesis was used. The detection results of the Plagiarism Detection Systems were, as expected, pretty poor.

PlagAware: Initially 28% on the first 159 pages, however this included a lot of garbage such as pastebin material. After we removed this and the GuttenPlag links, the amount went to 68% before the report disappeared completely. We have not been able to resubmit, it breaks off with an error.
iThenticate: 40%
Ephorus: 5%! Only 10 possible sources found, of these 3 were GuttenPlag and one a duplicate
PlagScan: 15.9%
Urkund: 21%

The results of the experiment are published in iX 6/11. Here you can find a summary.

So far Plagiarism Detection Systems rely solely on text analysis, but text-based detection systems struggle, as study results show, to identify paraphrased forms of plagiarism, idea plagiarism and translation-plagiarism.

In our paper “Comparative Evaluation of Text- and Citation-based Plagiarism Detection Approaches using GuttenPlag”, we have evaluated whether analyzing the citations of a document could help to increase detection rates.

A preprint of our paper (to be published in June 13th at the JCDL 11 conference in Ottawa), in which we evaluate the potential of citation-based plagiarism detection systems using Guttenberg’s doctoral thesis, can be found here.

The abstract:

Various approaches for plagiarism detection exist. All are based on more or less sophisticated text analysis methods such as string matching, fingerprinting or style comparison. In this paper a new approach called Citation-based Plagiarism Detection is evaluated using a doctoral thesis, in which a volunteer crowd-sourcing project called GuttenPlag identified substantial amounts of plagiarism through careful manual inspection. This new approach is able to identify similar and plagiarized documents based on the citations used in the text. It is shown that citation-based plagiarism detection performs significantly better than text-based procedures in identifying strong paraphrasing, translation and some idea plagiarism. Detection rates can be improved by combining citation-based with text-based plagiarism detection.
  • pdf Using Guttenberg’s doctoral thesis to test plagiarism detection systems B. Gipp, N. Meuschke, and J. Beel, "Comparative Evaluation of Text- and Citation-based Plagiarism Detection Approaches using GuttenPlag," in Proceedings of the 11th ACM/IEEE Joint Conference on Digital Libraries (JCDL`11), Ottawa, Canada, 2011.
    [Bibtex]
    @INPROCEEDINGS{Gipp11b,
      author = {Bela Gipp and Norman Meuschke and Joeran Beel},
      title = {Comparative Evaluation of Text- and Citation-based Plagiarism Detection
      Approaches using GuttenPlag},
      booktitle = {{P}roceedings of the 11th {ACM}/{IEEE} {J}oint {C}onference on {D}igital
      {L}ibraries ({JCDL}`11)},
      year = {2011},
      address = {Ottawa, Canada},
      month = {June},
      abstract = {Various approaches for plagiarism detection exist. All are based on
      more or less sophisticated text analysis methods such as string matching,
      fingerprinting or style comparison. In this paper a new approach
      called Citation-based Plagiarism Detection is evaluated using a doctoral
      thesis, in which a volunteer crowd-sourcing project called GuttenPlag
      identified substantial amounts of plagiarism through careful manual
      inspection. This new approach is able to identify similar and plagiarized
      documents based on the citations used in the text. It is shown that
      citation-based plagiarism detection performs significantly better
      than text-based procedures in identifying strong paraphrasing, translation
      and some idea plagiarism. Detection rates can be improved by combining
      citation-based with text-based plagiarism detection.}
    }

 

The developed algorithms for the citation analysis and pattern matching will be presented at the DocEng conference in Mountain View in September. I’ll upload a preprint soon.

Leave a Reply

Connect with Facebook

  

  

  

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>