Go to main content
Formats
Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Résumé

This article introduces an automated method for cross-lingual similarity assessment for plagiarism detection. This method is applied to help for automated plagiarism detection, comparing a suspicious text to an indexed corpus. The approach is based on the usage of a multilingual sentence encoder, to embed the semantic of sentences extracted from the suspicious document. We then retrieve the most similar sentence from the reference corpus. To address the scalability issue of the nearest neighbor search in high-dimensional vector search, we use Faiss, a heuristic search engine based on Voronoi cells of cluster’s centroids. Finally, a classifier is trained using the highest cosine similarity between the sentence and the reference corpus items, the sentence embeddings, and a few other features to classify the duplicated content. This method is evaluated on 328 documents of our real media partner database with different metrics. Top approach achieves a F1 score of 89% which would be confirmed with a larger and more representative dataset.

Détails

Actions