We have hosted the application grab site in order to run this application in our online workstations with Wine or directly.


Quick description about grab site:

grab-site is an open source web crawling tool designed to archive and back up websites by recursively downloading their content. It works by taking a starting URL and systematically following links across the site, capturing pages and resources and saving them into WARC archive files for long-term preservation. Internally, the crawler uses a fork of the wpull engine to fetch and process web pages efficiently during large-scale crawls. grab-site includes a built-in dashboard that displays real-time crawl activity, including which URLs are currently being processed and how many remain in the queue. Users can dynamically apply ignore patterns during an active crawl, allowing them to skip problematic or unnecessary URLs that could slow down or block the archiving process. grab-site also provides predefined ignore sets for common site structures such as forums and other complex web platforms. Additional mechanisms like duplicate page detection help avoid re-crawling identical content.

Features:
  • Recursive website crawling starting from one or more URLs
  • Saves captured content in WARC archival format
  • Built-in dashboard for monitoring active crawls and URL queues
  • Dynamic ignore patterns that can be edited while crawling
  • Duplicate page detection to avoid reprocessing identical content
  • Disk-based URL queue designed for very large crawl workloads


Programming Language: Python, Unix Shell.
Categories:
Web Scrapers

Page navigation:

©2024. Winfy. All Rights Reserved.

By OD Group OU – Registry code: 1609791 -VAT number: EE102345621.