Building a .NET Site Exporter Tool Using Sitemap and Recursive Crawling

When working with static sites, documentation portals or CMS-powered websites, it's often necessary to generate a complete offline copy of the site, either for archiving, backup or redeployment. In this guide, we'll walk through building a custom tool in .NET that does just that, reads a sitemap.xml, crawls all the pages, recursively discovers nested URLs and downloads associated resources like images, JavaScript files, CSS and even background images embedded in style attributes.

📦 What This Tool Does

This .NET console app performs the following:

Loads all URLs from a sitemap.xml
Appends custom URLs (like /404, robots.txt, etc.)
Recursively crawls each page for internal links
Downloads all discovered pages and resources
Checks file sizes to avoid redundant downloads
Preserves folder structure locally

🏗️ Project Structure and Key Components

1. Sitemap Loading and Manual URLs

The app begins by loading all URLs defined in the sitemap and then appends any manually defined URLs to ensure completeness (e.g. error pages, robots files).

var manualPaths = new[] { "/sitemap.xml", "/404", "/robots.txt", "/humans.txt", "/manifest.json" };

These paths are resolved relative to your base domain and fed into the crawling queue.

2. Recursive Crawling Logic

Using a queue and hash set to manage visited URLs, the tool downloads each page and extracts links from <a> tags, ensuring deep structures like:

/categories/backend-development/page/2
/articles/page/3

...are discovered and processed.

3. Asset Extraction

Assets are extracted from:

<link rel="stylesheet">
<script src="">
<img src="">
style="background-image: url(...)" inline styles

A regular expression is used to parse url() references inside style attributes. All resolved resources are downloaded only if:

They do not exist locally or
Their file size differs from the remote version

This ensures efficiency without re-downloading unchanged content.

🛠️ Tech Stack

.NET 6+
HttpClient for requests
HtmlAgilityPack for HTML parsing
System.Text.RegularExpressions for CSS url() extraction
XDocument for XML parsing

📁 Output Structure

The tool replicates the site structure in a local folder such as:

ExportedSite/
  └── example.com/
      ├── index.html
      ├── articles/
      │   └── page/
      │       └── 2/
      │           └── index.html
      ├── media/
      └── manifest.json

All relative references remain untouched, assuming the site's original directory structure will be preserved when served offline or uploaded elsewhere.

✨ Use Cases

Backing up your static site or headless CMS
Moving content from one host to another
Testing site changes locally
Creating offline versions for distribution (e.g. kiosks or documentation bundles)

🧪 Future Improvements

You can expand this tool by:

Adding CLI arguments for flexibility
Rewriting URLs inside HTML to point to local paths
Filtering crawl depth or matching specific URL patterns
Saving a summary or log of downloaded files

🚀 Conclusion

This small but powerful .NET app provides a tailored alternative to tools like wget or httrack, especially when you need full control over how and what gets downloaded from your site. With just a few lines of code, you can export your web presence exactly the way you want.

Let me know if you’d like to turn this into a public tool or integrate it with your CI/CD workflow.

Github Repo: https://github.com/sekmenhuseyin/Sekmen.StaticSiteGenerator