Building a .NET Site Exporter Tool Using Sitemap and Recursive Crawling

A practical way to save full websites including assets

Posted by Hüseyin Sekmenoğlu on February 02, 2025 Backend Development

When working with static sites, documentation portals or CMS-powered websites, it's often necessary to generate a complete offline copy of the site, either for archiving, backup or redeployment. In this guide, we'll walk through building a custom tool in .NET that does just that, reads a sitemap.xml, crawls all the pages, recursively discovers nested URLs and downloads associated resources like images, JavaScript files, CSS and even background images embedded in style attributes.


📦 What This Tool Does

This .NET console app performs the following:

  • Loads all URLs from a sitemap.xml

  • Appends custom URLs (like /404, robots.txt, etc.)

  • Recursively crawls each page for internal links

  • Downloads all discovered pages and resources

  • Checks file sizes to avoid redundant downloads

  • Preserves folder structure locally


🏗️ Project Structure and Key Components

1. Sitemap Loading and Manual URLs

The app begins by loading all URLs defined in the sitemap and then appends any manually defined URLs to ensure completeness (e.g. error pages, robots files).

var manualPaths = new[] { "/sitemap.xml", "/404", "/robots.txt", "/humans.txt", "/manifest.json" };

These paths are resolved relative to your base domain and fed into the crawling queue.

2. Recursive Crawling Logic

Using a queue and hash set to manage visited URLs, the tool downloads each page and extracts links from <a> tags, ensuring deep structures like:

/categories/backend-development/page/2
/articles/page/3

...are discovered and processed.

3. Asset Extraction

Assets are extracted from:

  • <link rel="stylesheet">

  • <script src="">

  • <img src="">

  • style="background-image: url(...)" inline styles

A regular expression is used to parse url() references inside style attributes. All resolved resources are downloaded only if:

  • They do not exist locally or

  • Their file size differs from the remote version

This ensures efficiency without re-downloading unchanged content.


🛠️ Tech Stack

  • .NET 6+

  • HttpClient for requests

  • HtmlAgilityPack for HTML parsing

  • System.Text.RegularExpressions for CSS url() extraction

  • XDocument for XML parsing


📁 Output Structure

The tool replicates the site structure in a local folder such as:

ExportedSite/
  └── example.com/
      ├── index.html
      ├── articles/
      │   └── page/
      │       └── 2/
      │           └── index.html
      ├── media/
      └── manifest.json

All relative references remain untouched, assuming the site's original directory structure will be preserved when served offline or uploaded elsewhere.


✨ Use Cases

  • Backing up your static site or headless CMS

  • Moving content from one host to another

  • Testing site changes locally

  • Creating offline versions for distribution (e.g. kiosks or documentation bundles)


🧪 Future Improvements

You can expand this tool by:

  • Adding CLI arguments for flexibility

  • Rewriting URLs inside HTML to point to local paths

  • Filtering crawl depth or matching specific URL patterns

  • Saving a summary or log of downloaded files


🚀 Conclusion

This small but powerful .NET app provides a tailored alternative to tools like wget or httrack, especially when you need full control over how and what gets downloaded from your site. With just a few lines of code, you can export your web presence exactly the way you want.

Let me know if you’d like to turn this into a public tool or integrate it with your CI/CD workflow.

Github Repo: https://github.com/sekmenhuseyin/Sekmen.StaticSiteGenerator