Robots & Sitemap Files
Robots.txt and Sitemap files are standard web resources used to guide search engine crawlers. While they are intended for legitimate purposes, they can also reveal sensitive information that attackers may exploit during reconnaissance.
1. Robots.txt
Purpose:
-
Located at:
https://example.com/robots.txt
-
Instructs web crawlers which parts of a site should not be indexed.
-
Uses a simple User-agent and Disallow/Allow directive format.
Example:
User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /public/
Security Perspective:
-
Although it’s not a security feature, disallowed paths often point to sensitive areas (admin panels, staging environments, test APIs).
-
Attackers may use it to discover hidden directories.
2. Sitemap Files
Purpose:
-
Usually located at:
https://example.com/sitemap.xml
-
Lists all important URLs for search engine indexing.
-
Can include multiple sitemaps, and often auto-generated by CMSs.
Example (XML format):
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/</loc>
<lastmod>2025-08-10</lastmod>
</url>
<url>
<loc>https://example.com/admin/login</loc>
</url>
</urlset>
Security Perspective:
-
Can expose non-public URLs such as beta pages, admin portals, or development endpoints.
-
Multiple sitemaps might reference hidden areas or separate subdomains.
Cybersecurity & Pentesting Uses
Why attackers and pentesters check them:
-
Identify hidden directories and forgotten pages.
-
Find API endpoints or staging environments not meant for public access.
-
Cross-reference with Google Dorks for deeper discovery.
Pentesting Tip:
-
Always check
/robots.txt
and/sitemap.xml
early in recon. -
If a sitemap links to another XML file, follow it — large sites often chain them.
Best Practices for Security
-
Do not list sensitive areas in
robots.txt
— secure them with authentication instead. -
Keep sitemap entries limited to public, production-ready pages.
-
Use staging or dev environments with restricted access.
-
Regularly review sitemap generators for accidental URL inclusion.