Does the sitemap generator ignore paths set as "Exclude:" in robots.txt, particularly wildcard directives?
I was hoping that leaving these configurations as default would simulate a crawl as googlebot would, however I see the sitemap generator crawling and parsing url's that are set as Excluded in robots.txt
I'm crawling a Magento site with ~700 products in ~100 categories and without robots.txt exclude, "noindex" meta tags and nofollow links this quickly turns into tens of thousands of URLs, which I neither want indexed (dup. content) nor crawled by the bots (un-necessary server load).
eg.,
robots.txt
Disallow: /products/*?dir*
Disallow: /products/*?dir=asc
but I still entries in debug like:
((include https://equinepodiatry.com.au/products/horseshoes-aluminium.html?dir=asc&order=price))
I have robots.txt turned on. (full config attached)
<option name="xs_robotstxt">1</option>
The sitemap generator actually just chokes on this and needs to be restarted multiple times.
Continue the interrupted session
Updated on 2014-07-20 01:48:42, Time elapsed: 2:02:04,
Pages crawled: 7911 (4481 added in sitemap), Queued: 16, Depth level: 8
Current page: https://domain.com/products/horseshoes-glue-on/sigafoos.html?dir=asc&limit=9&order=name&p=11 (1.1)