We have hosted the application nemo curator in order to run this application in our online workstations with Wine or directly.
Quick description about nemo curator:
NeMo Curator is a Python library specifically designed for fast and scalable dataset preparation and curation for large language model (LLM) use-cases such as foundation model pretraining, domain-adaptive pretraining (DAPT), supervised fine-tuning (SFT) and paramter-efficient fine-tuning (PEFT). It greatly accelerates data curation by leveraging GPUs with Dask and RAPIDS, resulting in significant time savings. The library provides a customizable and modular interface, simplifying pipeline expansion and accelerating model convergence through the preparation of high-quality tokens. At the core of the NeMo Curator is the DocumentDataset which serves as the the main dataset class. It acts as a straightforward wrapper around a Dask DataFrame. The Python library offers easy-to-use methods for expanding the functionality of your curation pipeline while eliminating scalability concerns.Features:
- Data download and text extraction
- Language identification and separation with fastText and pycld2
- Text reformatting and cleaning to fix unicode decoding errors via ftfy
- Document-level deduplication
- Multilingual heuristic-based filtering
- Distributed data classification
Programming Language: Python.
Categories:
©2024. Winfy. All Rights Reserved.
By OD Group OU – Registry code: 1609791 -VAT number: EE102345621.