I am stuck in the same bind that many people trying to use this script are in. My site is hosted on a shared server at a hosting company. We are not able to increase the amount of memory available to the process. Based on the other threads I've seen in these forums, it appears 32k is a standard amount we all get allocated.
I'm looking to do two things: (1) figure out a way I can index my site given the constraints I have and (2) offer suggestions for improving the script so it will work under lower memory constraints.
(1) I bought this script because I want to index more then just 100 pages of my site. Given the memory constraints, that's about all I get. If I go another level deeper, I fun out of memory. If all I get is 100 pages, this script isn't much use to me. Is it possible to run this script on one of my personal machines that I have control over and can set the memory to whatever I want and have it index a web site that is on a different machine (the site on the hosted server)? I assume I can. However, the resulting files will be saved on my local machine, correct? So, does this mean I will have to write a batch job to call the sitemap script, then when its done ftp the files, and when that's done ping Google?
(2) I haven't looked at the code of the script yet, but I will assume you are not already doing these things.
(2a) Have you considered some sort of compression algorithm to make the URL list take up less space in memory?
(2b) Have you considered adding a config value indicating the max memory the script should use and then if the process requires more then that much memory, it starts reading/writing the urls to disk? It would take longer, but at least it would work. Rather then have 1 huge growing file of URLs which would start taking way to long to process to check each url, look at using a bunch of smaller files. Take the URL and add the ASCII values of each character in the URL, then MOD 1000. Create/open the file with the name that includes that mathematical result, say tmp743.txt, and read the URLs from the file one at a time to see if any match. If no match is found, add this URL to the end of the file. If you combine this with 2a, it will go even faster.