We have hosted the application corpus redundancy manager in order to run this application in our online workstations with Wine or directly.
Quick description about corpus redundancy manager:
Redundancy due to cut-paste operations in text creates bias in machine learning for NLP.This module takes a directory and produces a subset of the files in that directory (in a list) with an upper bound on similarity between two files.
Features:
- Identify copy paste redundancy in a document corpus
- Input: a folder with text documents and similarity threshold
- Output (a) a list of non-redundant documents (a non-redundant subset of the corpus)
- Output (b) list of document pairs found to be redundant with the amount of redundancy for the pair
- Python script (2.6) - tested on various Linux flavours + Windows XP/7
Audience: Science/Research.
User interface: Console/Terminal.
Programming Language: Python.
Categories:
©2024. Winfy. All Rights Reserved.
By OD Group OU – Registry code: 1609791 -VAT number: EE102345621.