|
12-23-2003, 07:16 AM | #1 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Cron Job on Linux/Apache
Hi. Say you want to run a cron job that spiders on the 1st and 15th of every month.
First make a list of full URLs (e.g., http://www.domain.com) to be crawled, one per line, in a file called cronlist.txt (add or remove URLs in the cronlist.txt file when not indexing). Then create a file called cronfile.txt that contains the following on one line, editing the full paths as needed: Code:
0 0 1,15 * * /full/path/to/php -f /full/path/to/admin/spider.php /full/path/to/cronlist.txt >> /full/path/to/spider.log Code:
/full/path/to/crontab /full/path/to/cronfile.txt -u You may also replace "/full/path/to/cronlist.txt" (without quotes) in the cronfile.txt file with "http://www/domain.com" or "all" or "forceall" (without quotes) for different indexing options. If you have CRON_ENABLE set to true in the config file, you may use the cronfile.txt created by PhpDig in place of a manually created cronfile.txt file. To see that your cron job is set, type /full/path/to/crontab -l from shell. If you want to delete the cron job, type /full/path/to/crontab -d from shell. A general cron tutorial can be found at http://www.linuxhelp.net/guides/cron/
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
01-11-2004, 03:49 PM | #2 |
Green Mole
Join Date: Dec 2003
Posts: 11
|
Thanks for the tutorial...
Questions: Will this update or simply fresh-spider every site in the text file list? In connection, will this method use stored usernames and passwords for password protected sites? What sort of server load(on average) will running spiders on a whole list of sites create? Lastly, wouldn't it be simpler to setup a cron job to run a "spiderupdate.php" or equivilant? spiderupdate.php could pull all the URLs out of the database and spider them according to the config settings. Beats manually entering several hundred URL's(although one could probably export a text file with the URL's from the database table, as well). Thanks, -Paul |
01-13-2004, 07:16 PM | #3 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Options available via command line indexing are as follows:
#php -f [PHPDIG_DIR]/admin/spider.php [option] List of options : - all (default) : Update all hosts ; - forceall : Force update all hosts ; - http://mondomaine.tld : Add or update the url ; - path/file : Add or update all urls listed in the given file. Some examples are given here and cronlist.txt could be replaced with any of the options. Option all updates sites according to the time limit as set via the config file or meta tag. Forceall forces the update of sites regardless of time limit. Using a single URL will index or update a site according to the time limit. Using a file will index or update the sites in the file as well as other sites already in the database according to the time limit. If site information is already stored in the database tables, that information should be used in an update. Because of the options available, a "spiderupdate.php" isn't necessary. As for server load, that depends on the particular server. The best thing to do would be to setup some test sites, try the different options, and then run uptime or top via shell to check server load for your particular machine. EDIT: As of PhpDig 1.8.0, using a file will index or update only the sites in the file, assuming the tempspider table is empty between runs.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
04-01-2004, 11:46 PM | #4 |
Orange Mole
Join Date: Mar 2004
Posts: 48
|
Here is page that will generate cron tabs for you.
http://www.clockwatchers.com/cgi-clockwatchers/crontool |
04-02-2004, 12:48 AM | #5 |
Orange Mole
Join Date: Mar 2004
Posts: 48
|
Hi Charter,
This still isn't clear to me. If you set a cron job to index URLs in a file list, once the spider has indexed the list, when the cronjob runs again what method will the spider use to index these URLs? Will it index all pages found regardless if the update date has not been reached or will it skip files that have been recently indexed and are not due to be index yet? |
04-10-2004, 04:49 PM | #6 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Using a file will index or update the sites in the file according to the time limit. Only the forceall option ignores this update date.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
06-03-2004, 02:00 AM | #7 |
Green Mole
Join Date: May 2004
Posts: 2
|
autoindexing, RaQ 550, Mailman
I have a RaQ 550 server running Linux, dedicated primarily for the installed Mailman listserver app. Had a difficult time trying to autoindex, using all the various approaches in various threads here. But figured it out and it's now working!
Here's how I did it by simply adding an entry to /etc/crontab From Charter: "Hi. If you wish to call spider.php from a directory other than the admin directory, you need to edit the first if statement in the config file so that it allows for the different path, that path being a relative and/or full path UP TO but NOT including the admin directory - no ending slash." I added in config.php: if ((isset($relative_script_path)) && ($relative_script_path != ".") && ($relative_script_path != "..") && ($relative_script_path != "/home/.sites/28/site1/.users/91/lists/web/search")) { exit(); ----- EDIT: As of PhpDig 1.8.1 use the following in the config file. define('ABSOLUTE_SCRIPT_PATH','/home/.sites/28/site1/.users/91/lists/web/search'); // full path up to but not including admin dir, no end slash if ((!isset($relative_script_path)) || (($relative_script_path != ".") && ($relative_script_path != "..") && ($relative_script_path != ABSOLUTE_SCRIPT_PATH))) { // echo "\n\nPath $relative_script_path not recognized!\n\n"; exit(); } ----- Then I added to: /etc/crontab # phpdig autoindex 02 1,13 * * * root php -f /home/.sites/28/site1/.users/91/lists/web/search/admin/spider.php /home/.sites/28/site1/.users/91/lists/web/search/admin/cronlist.txt >> /home/.sites/28/site1/.users/91/lists/web/search/admin/spider.log *note: all one line under the # phpdig autoindex, used full paths for spider.php and spider.log (didn't need full path for php -f), and used cronlist.txt containing the URL to index. Now phpdig automatically reindexes the site (Mailman listserver) at 1:02am and 1:02pm. Since phpdig is an external search app, I simply found the right html template in Mailman and added html coding to display the phpdig graphic and linked it to the search page, on the archive's table of contents page. Hope this info helps out. |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Baffled by Cron Job | Slider | How-to Forum | 2 | 12-15-2004 11:34 PM |
Reindex without cron job? | ark2424 | How-to Forum | 8 | 12-09-2004 05:54 AM |
cron job problems | takpoli | How-to Forum | 3 | 05-12-2004 01:26 PM |
Alternative to Cron job? | jirving | Troubleshooting | 1 | 09-29-2003 05:07 PM |
cron job | David J Harmon | How-to Forum | 1 | 09-27-2003 07:20 AM |