We have hosted the application corpus redundancy manager to run in windows online over linux online in order to run this application in our online workstations with Wine or directly.


Quick description about corpus redundancy manager to run in windows online over linux online:

Redundancy due to cut-paste operations in text creates bias in machine learning for NLP.
This module takes a directory and produces a subset of the files in that directory (in a list) with an upper bound on similarity between two files.

Features:
  • Identify copy paste redundancy in a document corpus
  • Input: a folder with text documents and similarity threshold
  • Output (a) a list of non-redundant documents (a non-redundant subset of the corpus)
  • Output (b) list of document pairs found to be redundant with the amount of redundancy for the pair
  • Python script (2.6) - tested on various Linux flavours + Windows XP/7


Audience: Science/Research.
User interface: Console/Terminal.
Programming Language: Python.

.

Page navigation:

©2024. Winfy. All Rights Reserved.

By OD Group OU – Registry code: 1609791 -VAT number: EE102345621.