![]() |
Cronjob problem
Hello,
I've a web site I wish index with this great tool: www.john-howe.com No problem with the web interface with a depth of 5000, execpt it' stop after 30 minutes (once after 1 Hours 23 minutes). I read some post in this forum, i try to get it from a cronjobs, I try my crontab on my host where I am with john-howe.com No way, it won't work... ;o( I try from my personnal web site: www.metadelic.com on my cpanel with this synthax: Code:
php -f http://www.john-howe.com/admin/spider.php http://www.john-howe.com Code:
No input file specified Code:
wget http://www.john-howe.com/admin/spider.php http://www.john-howe.com Code:
Subject: Cron <metadeco@server857> wget http://www.john-howe.com/admin/spider.php http://www.john-howe.com All headers So how can I index the whole site? Any suggestion. A lot of thx for your help and times, Dominique Javet |
And I forgot... it save nothing into DB!
Regards, DOM |
Stop indexing in web interface
Hello,
I've install the last verison of phpdig and al is ok, I can index a part of my web site (where is installed phpdig), but when I try to index the whole site, after a certain time (r****mly?) it's stop indexing! and keep lock the site. I've safe_mode off and I dont think it's the timeout. I've also a few problem with cron job (don't work), but i wish make the first index form web and from root (my site had a 1500 pages (stat + dynamic)). -> http://www.phpdig.net/forum/showthread.php?t=1706 Do you experience some problem like this? What can I do? Do I index part to part of my site or can I say index the whole site and let turn spider.php all the night to index via web interface? How do you proceed? Regards, Dom |
Is the following URL where your spider.php file is?
Code:
http://www.john-howe.com/admin/spider.php Code:
/[PHPDIG-directory]/admin/spider.php Code:
/home/username/public_html/[PHPDIG-directory]/admin/spider.php Code:
php -f /home/username/public_html/[PHPDIG-directory]/admin/spider.php http://www.john-howe.com/ |
Thx for your reply.
I try this one from a external server: Code:
wget http://www.john-howe.com/search/admin/spider.php http://www.john-howe.com Code:
Subject: Cron <metadeco@server857> wget http://www.john-howe.com/search/admin/spider.php http://www.john-howe.com All headers My unprotected admin: http://www.john-howe.com/search/admin/ Why when I do the root index it stop aftre a few time? :bang: I try also this internal via my admin web panel with no result :angry: : Code:
/usr/bin/php -f /home/www/web330/html/search/admin/temp/spider.php forceall http://www.john-howe.com/search/admin/cronfile.txt >> spider.log I apprecieate your help and time. Regards, Dominique |
As for the spider stopping prematurely, packets get lost, connections drop, browsers or servers timeout, hosts may kill the process, take your pick. As for setting a cron or running PhpDig from shell, see section 7 of the updated documentation.
|
Thx a lot! Seem to works :banana:
It's indexing since 2 hours via cron and it's continue. But I must notice that without replacing the relative path in the files write in config.php, it's not working (for me), I have to replace in all the scripts page and then and only then it's working... I use the conr job on the same host where is my site. From a another web cron server physically distanced, it's not working. I mus search why.... BTW, now it's working and indexing my site. Thx a lot for your explaination. Now is clear with your updated documentation. All my best regards, Dominique PS: Super soft et job, merci! Bonjour de la Suisse. |
Glad it's working for you, but you don't have to change all the files, just set ABSOLUTE_SCRIPT_PATH in the config file.
PHP Code:
|
I've done, but when I go after this on the web interface, I've a blank white screen... and then after the replace of relative path with the absolute, all is working well again.
Hummm... Dom |
What version of PhpDig are you using?
|
Hello,
The last one, 1.8.6 on Linux. What I notice too, is that my cron job work only with forceall! When I use all or my domain to update, it's not working... Regards, Dom |
That doesn't make sense. Read the documentation in toto and see if it doesn't help.
|
I understand and read very carefully the documentation but that the truth :o
Maybe it's depending from the ISP I don't know... I try to with the wget command from a external site, and that's doesnt work. Why it's working for somebody and not for me? Don't know. I'm still trying and test. Dom |
What is your LIMIT_DAYS set to in the config file?
|
define('SEARCH_DEFAULT_LIMIT',10); //results per page
define('SPIDER_MAX_LIMIT',2000); //max recurse levels in spider define('RESPIDER_LIMIT',5); //recurse respider limit for update define('LINKS_MAX_LIMIT',20); //max links per each level define('RELINKS_LIMIT',5); //recurse links limit for an update //for limit to directory, URL format must either have file at end or ending slash at end //e.g., http://www.domain.com/dirs/ (WITH ending slash) or http://www.domain.com/dirs/dirs/index.php define('LIMIT_TO_DIRECTORY',false); //limit index to given (sub)directory, no sub dirs of dirs are indexed define('LIMIT_DAYS',0); //default days before reindex a page define('SMALL_WORDS_SIZE',2); //words to not index - must be 2 or more define('MAX_WORDS_SIZE',30); //max word size Dom |
PHP Code:
|
I will try this tonigh.
After a few days of test and read the documentation, the concept behind these setting (recurse link, etc) are not so clear. Why use 2000 instead of 20, why use sleep(5) when sleep (2) work really fine, etc... mst I keep the t.t files under text_content when indeing is done? etc... One thing I notice, is that my ISP update the cronjob every 30 minutes, on my other ISP it's every minute... :bang: I had saved a lot of time and frustration when I know this! And offcourse, isn't documented into the ISP online help... So, I continue with my test. I will let you know. Exist a tutorial abour fine tuning of the setting we discuss? I found a cron tutorial, but nothing else. BTW, it's really impressive. I've a site with 3400 indexed pages, 696'500 keywords. Excellent. Regards, Dom |
All times are GMT -8. The time now is 12:25 AM. |
Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.