Re: Only spiders a small amount of my pages
« Reply #15 on: August 10, 2005, 12:10:22 AM »
Hi,

I don't know if this will help but in /var/log/messages I find a load of these. An entry per crawl or page found:

preg_match(): Unknown modifier '|' in /home/httpd/htdocs/sitemap/pages/class.grab.inc.php(2)

BTW the only config thing I can think of that would cause this sort of problem to your script would be that safe mode is on. But that said, that should just mean that your script cannot set a max exec time.
Re: Only spiders a small amount of my pages
« Reply #16 on: August 10, 2005, 02:49:13 AM »
Hi,

the script was tested in "safe_mode on" environment and works fine except of max time setting.
That preg_match() error looks strange (it was never reported before) and probably may be the reason for crawling problem. Didn't you modified the config.inc.php file manually or set some specific config values? (I can't access the generator instance that you sent me anymote)
Re: Only spiders a small amount of my pages
« Reply #17 on: August 10, 2005, 12:18:17 PM »
Hello

I have just tested this on another one of my servers spidering the same website and it fails again. This server has apache1.x PHP 4.4.0 with a bog standard unoptimised PHP config. I also made sure eaccelerator was not installed on there.

I will PM you the updated url to the sitemap install. I want to know what is different when you try this. I am running my crawls from the script in a screen'd shell.
Re: Only spiders a small amount of my pages
« Reply #18 on: August 10, 2005, 05:43:31 PM »
Can you tell me what needs to be compiled into PHP for this to work ?

I am grabbing at straws now ...
Re: Only spiders a small amount of my pages
« Reply #19 on: August 10, 2005, 09:44:25 PM »
Hi,

the generator script doesn't require any external php modules to be compiled-in. It uses standard php network functions ([ External links are visible to logged in users only ]) for crawling.
I will try to recreate the same environment (apache 1.x and php4.4.0) to test it.
Re: Only spiders a small amount of my pages
« Reply #20 on: August 10, 2005, 10:45:37 PM »
Thanks for your help.

This is driving me nuts :-)

Have tried it now on 2 servers. The only thing I have left to try is to install LAMP on one of my Windows PCs here and see if I can crawl it from here. Even if I did manage to do it it wouldn't get me very far other than to have the sitemap files to upload to my server.
Re: Only spiders a small amount of my pages
« Reply #21 on: August 11, 2005, 12:12:05 AM »
Hi,

please try win lamp case if you have a chance, that will help to narrow down the problem :)
Re: Only spiders a small amount of my pages
« Reply #22 on: August 11, 2005, 02:54:47 PM »
Takes a long time to run on a home broadband connection ...

Have just recompiled PHP and Apache on one of my servers. 1.3.33 4.4.0 and basically made it the most bloated PHP install possible by adding everything (it was as close to a Microsoft version of PHP as possible) and still your script wouldn't crawl properly.

Things I have tried:

Disabling mod_security
Disabling eaccelerator
Disabling mod_deflate
Recompiling PHP
Recompiling Apache
Resetting all permissions on Generator scripts
Redownloading from your site and reuploading the script to my server
Re: Only spiders a small amount of my pages
« Reply #23 on: August 11, 2005, 02:57:10 PM »
I get a huge amount of these in my error_logs when the script is running:

[Thu Aug 11 14:56:29 2005] [error] [client xxx.xxx.xxx.xxx.xxx] request failed: error reading the headers, referer: