|
08-25-2004, 06:17 AM | #1 |
Orange Mole
Join Date: Nov 2003
Posts: 69
|
API function indexpage(URL, words)
I've been away from PhpDig and from this forum for a while. Today I've started again, and I installed a fresh 1.8.3 from scratch.
The site that I have to index is on my own PC, together with PhpDig. It's a dynamic site, with PHP script that I write myself. The script works with a database which defines the links and the indexable words that have to appear in the generated HTML pages. So here's my question, a "How to" question, or, if not currently possible, a mod request: Can I shortcut the spider, is there an API function that I can call from my script, telling PhpDig: for URL such-and-so, please put this list of words in your tables? And a question on the side: I have a new PC with lots of memory (1 GB) and yet it takes the spider 30 minutes to index 90 relatively short pages, even after I've commented out the "sleep(5)". Are there other admins in similar situations who have comparable experiences? (After spidering there is 1 site with 90 pages, 2050 keywords and 8880 references in engine, so really peanuts! With the browser, I can visit all the pages in about 1 minute...)
__________________
René Haentjens, Ghent University |
08-25-2004, 09:44 AM | #2 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
For 1: http://www.phpdig.net/forum/showthread.php?t=454
For 2: my guess is that some servers may not like the way I dealt with chunk encoding
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
08-26-2004, 02:47 AM | #3 |
Orange Mole
Join Date: Nov 2003
Posts: 69
|
Thanks, Head Mole!
1. Topic 454 is not quite about an API call, it looks to me like making spidering even slighly more complex. I am searching for a more direct channel for feeding the database with URL+keywords... 2. Can you be more specific? What is chunk encoding and where and how do you deal with it? I found the term in a couple of earlier notes and in the 1.8.3 CHANGELOG but without further explanations.
__________________
René Haentjens, Ghent University |
08-26-2004, 08:33 AM | #4 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
1. It doesn't exist like you want so it's a mod request.
2. Chunk encoding is when content is sent bytes, content, bytes, content, etcetera.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
09-02-2004, 01:19 AM | #5 |
Orange Mole
Join Date: Nov 2003
Posts: 69
|
I've started to implement the requested API myself. If I can make it work with a reasonable amount of effort, I'll report back here.
__________________
René Haentjens, Ghent University |
09-08-2004, 04:52 AM | #6 |
Orange Mole
Join Date: Nov 2003
Posts: 69
|
__________________
René Haentjens, Ghent University |
09-08-2004, 11:58 PM | #7 |
Orange Mole
Join Date: Nov 2003
Posts: 69
|
The reason why it took the spider 30 minutes to index 90 relatively short pages, is probably not chunk encoding. I now discovered that I have about 10 MB logfile and 9 MB error-logging by Apache during that half hour. The spider is making zillions of requests for "funny" kinds of URLs the whole time (see below).
Any idea what happened, Charter? Code:
157.193.197.26 - - [25/Aug/2004:11:31:51 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/index.php?sid=pptsl092&thumb=pptsl092_t.jpg HTTP/1.1" 200 0 157.193.197.26 - - [25/Aug/2004:11:31:51 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/index.php?sid=pptsl093&thumb=pptsl093_t.jpg HTTP/1.1" 200 0 157.193.197.26 - - [25/Aug/2004:11:31:51 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl001.jpg">/ HTTP/1.1" 404 0 157.193.197.26 - - [25/Aug/2004:11:31:51 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl001.jpg"> HTTP/1.1" 404 0 157.193.197.26 - - [25/Aug/2004:11:31:51 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl001_t.jpg"%20/>/ HTTP/1.1" 404 0 157.193.197.26 - - [25/Aug/2004:11:31:51 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl001_t.jpg"%20/> HTTP/1.1" 404 0 ... 157.193.197.26 - - [25/Aug/2004:11:33:46 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl093.jpg"%20/>/ HTTP/1.1" 404 0 157.193.197.26 - - [25/Aug/2004:11:33:46 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl093.jpg"%20/> HTTP/1.1" 404 0 157.193.197.26 - - [25/Aug/2004:11:33:46 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/index.php?sid=pptsl008&thumb=pptsl008_t.jpg HTTP/1.1" 200 0 157.193.197.26 - - [25/Aug/2004:11:33:46 +0200] "GET /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/index.php?sid=pptsl008&thumb=pptsl008_t.jpg HTTP/1.1" 200 8087 157.193.197.26 - - [25/Aug/2004:11:33:46 +0200] "HEAD /dokeos/claroline/css/default.css HTTP/1.1" 200 0 157.193.197.26 - - [25/Aug/2004:11:33:46 +0200] "HEAD /dokeos/index.php HTTP/1.1" 200 0 157.193.197.26 - - [25/Aug/2004:11:33:46 +0200] "HEAD /dokeos/claroline/auth/courses.php HTTP/1.1" 200 0 157.193.197.26 - - [25/Aug/2004:11:33:46 +0200] "HEAD /dokeos/claroline/auth/profile.php HTTP/1.1" 200 0 157.193.197.26 - - [25/Aug/2004:11:33:46 +0200] "HEAD /dokeos/index.php HTTP/1.1" 200 0 157.193.197.26 - - [25/Aug/2004:11:33:47 +0200] "HEAD /dokeos/VELODLA/index.php HTTP/1.1" 200 0 157.193.197.26 - - [25/Aug/2004:11:33:47 +0200] "HEAD /dokeos/index.php HTTP/1.1" 200 0 157.193.197.26 - - [25/Aug/2004:11:33:47 +0200] "HEAD /dokeos/VELODLA/index.php HTTP/1.1" 200 0 157.193.197.26 - - [25/Aug/2004:11:33:47 +0200] "HEAD /dokeos/VELODLA/index.php HTTP/1.1" 200 0 157.193.197.26 - - [25/Aug/2004:11:33:47 +0200] "HEAD /dokeos/claroline/document/document.php HTTP/1.1" 200 0 157.193.197.26 - - [25/Aug/2004:11:33:47 +0200] "HEAD /dokeos/claroline/scorm/scormdocument.php HTTP/1.1" 200 0 157.193.197.26 - - [25/Aug/2004:11:33:48 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl001.jpg">/ HTTP/1.1" 404 0 157.193.197.26 - - [25/Aug/2004:11:33:48 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl001.jpg"> HTTP/1.1" 404 0 ... 157.193.197.26 - - [25/Aug/2004:12:00:43 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl093.jpg"%20/>/ HTTP/1.1" 404 0 157.193.197.26 - - [25/Aug/2004:12:00:43 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl093.jpg"%20/> HTTP/1.1" 404 0 157.193.197.26 - - [25/Aug/2004:12:00:43 +0200] "POST /dokeos/VELODLA/183phpdig/admin/spider.php HTTP/1.1" 200 48484 157.193.197.26 - - [25/Aug/2004:12:01:58 +0200] "GET /dokeos/VELODLA/183phpdig/admin/index.php HTTP/1.1" 200 4427 Error log: [Wed Aug 25 11:31:32 2004] [error] [client 157.193.197.26] File does not exist: h:/easyphp1-7/www/robots.txt [Wed Aug 25 11:31:51 2004] [error] [client 157.193.197.26] File does not exist: h:/easyphp1-7/www/dokeos/velodla/scorm/lectures/algal_blooms_1of2/"pptsl001.jpg">/ [Wed Aug 25 11:31:51 2004] [error] [client 157.193.197.26] File does not exist: h:/easyphp1-7/www/dokeos/velodla/scorm/lectures/algal_blooms_1of2/"pptsl001.jpg"> [Wed Aug 25 11:31:51 2004] [error] [client 157.193.197.26] File does not exist: h:/easyphp1-7/www/dokeos/velodla/scorm/lectures/algal_blooms_1of2/"pptsl001_t.jpg"/>/ [Wed Aug 25 11:31:51 2004] [error] [client 157.193.197.26] File does not exist: h:/easyphp1-7/www/dokeos/velodla/scorm/lectures/algal_blooms_1of2/"pptsl001_t.jpg"/> [Wed Aug 25 11:31:51 2004] [error] [client 157.193.197.26] File does not exist: h:/easyphp1-7/www/dokeos/velodla/scorm/lectures/algal_blooms_1of2/"pptsl001.jpg"/>/ [Wed Aug 25 11:31:51 2004] [error] [client 157.193.197.26] File does not exist: h:/easyphp1-7/www/dokeos/velodla/scorm/lectures/algal_blooms_1of2/"pptsl001.jpg"/> [Wed Aug 25 11:31:51 2004] [error] [client 157.193.197.26] File does not exist: h:/easyphp1-7/www/dokeos/velodla/scorm/lectures/algal_blooms_1of2/"pptsl002.jpg">/ [Wed Aug 25 11:31:51 2004] [error] [client 157.193.197.26] File does not exist: h:/easyphp1-7/www/dokeos/velodla/scorm/lectures/algal_blooms_1of2/"pptsl002.jpg">
__________________
René Haentjens, Ghent University |
09-11-2004, 07:29 AM | #8 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
It could be JavaScript. Back whenever, spaces and parentheses were allowed. See this post for where to remove such characters, but note that the post is no longer slashed correctly due to the vB upgrade.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Help using output as Array function | ehdesign | How-to Forum | 0 | 09-16-2006 06:32 AM |
Call to undefined function: mb_eregi() | PHPfranky | Troubleshooting | 0 | 12-03-2005 07:40 AM |
API to (re-)index a virtual directory | renehaentjens | Mod Submissions | 2 | 09-11-2004 07:34 AM |
Phpdig API Link Has Gone Missing | vinyl-junkie | The Mole Hole | 1 | 08-03-2004 01:57 PM |
Google API [Did you mean ***?] | gooseman | Mod Requests | 0 | 04-24-2004 06:50 AM |