When working with static sites, documentation portals or CMS-powered websites, it's often necessary to generate a complete offline copy of the site, either for archiving, backup or redeployment. In this guide, we'll walk through building a custom tool in .NET that does just that, reads a sitemap.xml
, crawls all the pages, recursively discovers nested URLs and downloads associated resources like images, JavaScript files, CSS and even background images embedded in style attributes.
📦 What This Tool Does
This .NET console app performs the following:
Loads all URLs from a
sitemap.xml
Appends custom URLs (like
/404
,robots.txt
, etc.)Recursively crawls each page for internal links
Downloads all discovered pages and resources
Checks file sizes to avoid redundant downloads
Preserves folder structure locally
🏗️ Project Structure and Key Components
1. Sitemap Loading and Manual URLs
The app begins by loading all URLs defined in the sitemap and then appends any manually defined URLs to ensure completeness (e.g. error pages, robots files).
var manualPaths = new[] { "/sitemap.xml", "/404", "/robots.txt", "/humans.txt", "/manifest.json" };
These paths are resolved relative to your base domain and fed into the crawling queue.
2. Recursive Crawling Logic
Using a queue and hash set to manage visited URLs, the tool downloads each page and extracts links from <a>
tags, ensuring deep structures like:
/categories/backend-development/page/2
/articles/page/3
...are discovered and processed.
3. Asset Extraction
Assets are extracted from:
<link rel="stylesheet">
<script src="">
<img src="">
style="background-image: url(...)"
inline styles
A regular expression is used to parse url()
references inside style attributes. All resolved resources are downloaded only if:
They do not exist locally or
Their file size differs from the remote version
This ensures efficiency without re-downloading unchanged content.
🛠️ Tech Stack
.NET 6+
HttpClient
for requestsHtmlAgilityPack
for HTML parsingSystem.Text.RegularExpressions
for CSSurl()
extractionXDocument
for XML parsing
📁 Output Structure
The tool replicates the site structure in a local folder such as:
ExportedSite/
└── example.com/
├── index.html
├── articles/
│ └── page/
│ └── 2/
│ └── index.html
├── media/
└── manifest.json
All relative references remain untouched, assuming the site's original directory structure will be preserved when served offline or uploaded elsewhere.
✨ Use Cases
Backing up your static site or headless CMS
Moving content from one host to another
Testing site changes locally
Creating offline versions for distribution (e.g. kiosks or documentation bundles)
🧪 Future Improvements
You can expand this tool by:
Adding CLI arguments for flexibility
Rewriting URLs inside HTML to point to local paths
Filtering crawl depth or matching specific URL patterns
Saving a summary or log of downloaded files
🚀 Conclusion
This small but powerful .NET app provides a tailored alternative to tools like wget
or httrack
, especially when you need full control over how and what gets downloaded from your site. With just a few lines of code, you can export your web presence exactly the way you want.
Let me know if you’d like to turn this into a public tool or integrate it with your CI/CD workflow.
Github Repo: https://github.com/sekmenhuseyin/Sekmen.StaticSiteGenerator