PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Mod Requests

Reply
 
Thread Tools
Old 08-25-2004, 06:17 AM   #1
renehaentjens
Orange Mole
 
Join Date: Nov 2003
Posts: 69
API function indexpage(URL, words)

I've been away from PhpDig and from this forum for a while. Today I've started again, and I installed a fresh 1.8.3 from scratch.

The site that I have to index is on my own PC, together with PhpDig. It's a dynamic site, with PHP script that I write myself. The script works with a database which defines the links and the indexable words that have to appear in the generated HTML pages.

So here's my question, a "How to" question, or, if not currently possible, a mod request: Can I shortcut the spider, is there an API function that I can call from my script, telling PhpDig: for URL such-and-so, please put this list of words in your tables?

And a question on the side: I have a new PC with lots of memory (1 GB) and yet it takes the spider 30 minutes to index 90 relatively short pages, even after I've commented out the "sleep(5)". Are there other admins in similar situations who have comparable experiences? (After spidering there is 1 site with 90 pages, 2050 keywords and 8880 references in engine, so really peanuts! With the browser, I can visit all the pages in about 1 minute...)
__________________
René Haentjens, Ghent University
renehaentjens is offline   Reply With Quote
Old 08-25-2004, 09:44 AM   #2
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
For 1: http://www.phpdig.net/forum/showthread.php?t=454

For 2: my guess is that some servers may not like the way I dealt with chunk encoding
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 08-26-2004, 02:47 AM   #3
renehaentjens
Orange Mole
 
Join Date: Nov 2003
Posts: 69
Thanks, Head Mole!

1. Topic 454 is not quite about an API call, it looks to me like making spidering even slighly more complex. I am searching for a more direct channel for feeding the database with URL+keywords...

2. Can you be more specific? What is chunk encoding and where and how do you deal with it? I found the term in a couple of earlier notes and in the 1.8.3 CHANGELOG but without further explanations.
__________________
René Haentjens, Ghent University
renehaentjens is offline   Reply With Quote
Old 08-26-2004, 08:33 AM   #4
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
1. It doesn't exist like you want so it's a mod request.

2. Chunk encoding is when content is sent bytes, content, bytes, content, etcetera.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 09-02-2004, 01:19 AM   #5
renehaentjens
Orange Mole
 
Join Date: Nov 2003
Posts: 69
I've started to implement the requested API myself. If I can make it work with a reasonable amount of effort, I'll report back here.
__________________
René Haentjens, Ghent University
renehaentjens is offline   Reply With Quote
Old 09-08-2004, 04:52 AM   #6
renehaentjens
Orange Mole
 
Join Date: Nov 2003
Posts: 69
See http://www.phpdig.net/forum/showthread.php?p=5644
__________________
René Haentjens, Ghent University
renehaentjens is offline   Reply With Quote
Old 09-08-2004, 11:58 PM   #7
renehaentjens
Orange Mole
 
Join Date: Nov 2003
Posts: 69
The reason why it took the spider 30 minutes to index 90 relatively short pages, is probably not chunk encoding. I now discovered that I have about 10 MB logfile and 9 MB error-logging by Apache during that half hour. The spider is making zillions of requests for "funny" kinds of URLs the whole time (see below).

Any idea what happened, Charter?

Code:
157.193.197.26 - - [25/Aug/2004:11:31:51 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/index.php?sid=pptsl092&thumb=pptsl092_t.jpg HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:31:51 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/index.php?sid=pptsl093&thumb=pptsl093_t.jpg HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:31:51 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl001.jpg">/ HTTP/1.1" 404 0
157.193.197.26 - - [25/Aug/2004:11:31:51 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl001.jpg"> HTTP/1.1" 404 0
157.193.197.26 - - [25/Aug/2004:11:31:51 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl001_t.jpg"%20/>/ HTTP/1.1" 404 0
157.193.197.26 - - [25/Aug/2004:11:31:51 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl001_t.jpg"%20/> HTTP/1.1" 404 0
...
157.193.197.26 - - [25/Aug/2004:11:33:46 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl093.jpg"%20/>/ HTTP/1.1" 404 0
157.193.197.26 - - [25/Aug/2004:11:33:46 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl093.jpg"%20/> HTTP/1.1" 404 0
157.193.197.26 - - [25/Aug/2004:11:33:46 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/index.php?sid=pptsl008&thumb=pptsl008_t.jpg HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:33:46 +0200] "GET /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/index.php?sid=pptsl008&thumb=pptsl008_t.jpg HTTP/1.1" 200 8087
157.193.197.26 - - [25/Aug/2004:11:33:46 +0200] "HEAD /dokeos/claroline/css/default.css HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:33:46 +0200] "HEAD /dokeos/index.php HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:33:46 +0200] "HEAD /dokeos/claroline/auth/courses.php HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:33:46 +0200] "HEAD /dokeos/claroline/auth/profile.php HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:33:46 +0200] "HEAD /dokeos/index.php HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:33:47 +0200] "HEAD /dokeos/VELODLA/index.php HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:33:47 +0200] "HEAD /dokeos/index.php HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:33:47 +0200] "HEAD /dokeos/VELODLA/index.php HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:33:47 +0200] "HEAD /dokeos/VELODLA/index.php HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:33:47 +0200] "HEAD /dokeos/claroline/document/document.php HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:33:47 +0200] "HEAD /dokeos/claroline/scorm/scormdocument.php HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:33:48 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl001.jpg">/ HTTP/1.1" 404 0
157.193.197.26 - - [25/Aug/2004:11:33:48 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl001.jpg"> HTTP/1.1" 404 0
...
157.193.197.26 - - [25/Aug/2004:12:00:43 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl093.jpg"%20/>/ HTTP/1.1" 404 0
157.193.197.26 - - [25/Aug/2004:12:00:43 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl093.jpg"%20/> HTTP/1.1" 404 0
157.193.197.26 - - [25/Aug/2004:12:00:43 +0200] "POST /dokeos/VELODLA/183phpdig/admin/spider.php HTTP/1.1" 200 48484
157.193.197.26 - - [25/Aug/2004:12:01:58 +0200] "GET /dokeos/VELODLA/183phpdig/admin/index.php HTTP/1.1" 200 4427

Error log:
[Wed Aug 25 11:31:32 2004] [error] [client 157.193.197.26] File does not exist: h:/easyphp1-7/www/robots.txt
[Wed Aug 25 11:31:51 2004] [error] [client 157.193.197.26] File does not exist: h:/easyphp1-7/www/dokeos/velodla/scorm/lectures/algal_blooms_1of2/"pptsl001.jpg">/
[Wed Aug 25 11:31:51 2004] [error] [client 157.193.197.26] File does not exist: h:/easyphp1-7/www/dokeos/velodla/scorm/lectures/algal_blooms_1of2/"pptsl001.jpg">
[Wed Aug 25 11:31:51 2004] [error] [client 157.193.197.26] File does not exist: h:/easyphp1-7/www/dokeos/velodla/scorm/lectures/algal_blooms_1of2/"pptsl001_t.jpg"/>/
[Wed Aug 25 11:31:51 2004] [error] [client 157.193.197.26] File does not exist: h:/easyphp1-7/www/dokeos/velodla/scorm/lectures/algal_blooms_1of2/"pptsl001_t.jpg"/>
[Wed Aug 25 11:31:51 2004] [error] [client 157.193.197.26] File does not exist: h:/easyphp1-7/www/dokeos/velodla/scorm/lectures/algal_blooms_1of2/"pptsl001.jpg"/>/
[Wed Aug 25 11:31:51 2004] [error] [client 157.193.197.26] File does not exist: h:/easyphp1-7/www/dokeos/velodla/scorm/lectures/algal_blooms_1of2/"pptsl001.jpg"/>
[Wed Aug 25 11:31:51 2004] [error] [client 157.193.197.26] File does not exist: h:/easyphp1-7/www/dokeos/velodla/scorm/lectures/algal_blooms_1of2/"pptsl002.jpg">/
[Wed Aug 25 11:31:51 2004] [error] [client 157.193.197.26] File does not exist: h:/easyphp1-7/www/dokeos/velodla/scorm/lectures/algal_blooms_1of2/"pptsl002.jpg">
__________________
René Haentjens, Ghent University
renehaentjens is offline   Reply With Quote
Old 09-11-2004, 07:29 AM   #8
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
It could be JavaScript. Back whenever, spaces and parentheses were allowed. See this post for where to remove such characters, but note that the post is no longer slashed correctly due to the vB upgrade.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Help using output as Array function ehdesign How-to Forum 0 09-16-2006 06:32 AM
Call to undefined function: mb_eregi() PHPfranky Troubleshooting 0 12-03-2005 07:40 AM
API to (re-)index a virtual directory renehaentjens Mod Submissions 2 09-11-2004 07:34 AM
Phpdig API Link Has Gone Missing vinyl-junkie The Mole Hole 1 08-03-2004 01:57 PM
Google API [Did you mean ***?] gooseman Mod Requests 0 04-24-2004 06:50 AM


All times are GMT -8. The time now is 04:53 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.