PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Troubleshooting

Reply
 
Thread Tools
Old 02-10-2004, 09:23 AM   #1
RPIEL
Green Mole
 
Join Date: Feb 2004
Posts: 7
Question Indexation problems

Hi,

I encounter 3 different problems:

1) Most of my urls look like this :
http://resoform/services/index.php?s...a-liste&m=tous

the script which is called is always index.php.
For each page found the spider tries to index index.php without the query string. That makes no sense. The complete URIs are well indexed but the url "index.php" (without query string) is also found each time and identified as a doublon. Is it possible to change this ?

2) I have some listing pages where the navigation operates with links through the different pages. Because links are provided to the first, last, next or previous page the number of levels required to visit all the items of the list may be very important (perhaps greater than the phpDig limit). Is there a solution to this problem ?

3) I used phpDig html comments to exclude and include some parts of the html code. However I saw that some links which should have been excluded where visited. I don't feel that is normal. Does the exclude comment stop the content being indexed and the links being followed ?

Thanks for your help, and sorry to ask 3 questions at a time

Régis

__________________
Régis
RPIEL is offline   Reply With Quote
Old 02-11-2004, 01:58 AM   #2
tomas
Orange Mole
 
Join Date: Feb 2004
Posts: 47
hello rpiel,

try:
1) config.php line 97: define('PHPDIG_DEFAULT_INDEX',false);
set false to true
2) config.php line 84-86: define('SPIDER_MAX_LIMIT',100);
define('SPIDER_DEFAULT_LIMIT',100);
define('RESPIDER_LIMIT',100);
set limit eg. to 100 or more
3) you have to use the expression set in line 92 and 94 in
config.php: default <!-- phpdigExclude --><!-- phpdigInclude -->
in your html code use exclusive lines eg.
<html>
<body>
text to be searched....
<!-- phpdigExclude -->
text not to be searched....
<!-- phpdigInclude -->
......
</body>
</html>

hope this helps a little
tomas
tomas is offline   Reply With Quote
Old 02-11-2004, 04:49 AM   #3
RPIEL
Green Mole
 
Join Date: Feb 2004
Posts: 7
Hello Tomas,

Thank you for your answer.

1) defining PHPDIG_DEFAULT_INDEX to true did'nt solve my problem : now the url http://resodorm/services/ is indexed x times.
This takes a few seconds each time, after which the spider sees that there is a doublon. Anyway, in my system the script "index.php" isn't a page by itself but only with dynamic inclusion of other scripts and templates. It does'nt make sense to crawl it.

2) OK, putting the limit very high seems to be a solution.
However when I do this the same pages are indexed a lot of times ans the whole process takes several hours when it should takes about 15 minutes...
I think it would be a more reliable solution to maintain (or compute) a page containing all the links to index. This might solve the problem of requiring multiple levels to go through the items of a list.

3) I did use <!-- phpdigExclude --><!-- phpdigInclude --> comments in my pages, but sometimes the result was not what I expected.

sincerely

Régis
RPIEL is offline   Reply With Quote
Old 02-11-2004, 05:24 AM   #4
RPIEL
Green Mole
 
Join Date: Feb 2004
Posts: 7
About <!-- phpdigExclude --><!-- phpdigInclude --> comments :

They appear to work in indexing or not indexing the content of a page. The words in excludes parts of the document are not indexed.
However it seems to me that the links in these parts are followed. That is exactly what I want to avoid !

Has anyone dealed with this issue ?

Thanks by advance,

Régis
RPIEL is offline   Reply With Quote
Old 02-11-2004, 12:44 PM   #5
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. For one, try the code in this post but replace:
PHP Code:
//exclude if specific variable set
if (strpos($link['file'],'print=y')) {
$link['ok'] = 0;

with the following:
PHP Code:
//exclude if specific link
if (eregi('index.php$',$link['file'])) {
$link['ok'] = 0;

For two, it's probably faster to index your site in pieces, or start the index process using a different page.

For three, this thread may help.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 02-11-2004, 11:52 PM   #6
RPIEL
Green Mole
 
Join Date: Feb 2004
Posts: 7
Hi Charter,

Thank you for your answer.
I get the better results in building a special page for the indexing process.

This page has a robots meta tag with "noindex, follow" content.

Because the parts of the site I want to index can be retrieved by a simple query on my database, building this special page was easy.
Now I only need a depth of 1 for spidering process and it runs very fast.
All the pages I want to be indexed are retrieved just one time.

I think it is always (when possible), a good solution to build special pages in order to index the site. The path can be simplified and this will avoid to test a lot of pages to see if there are doublons.

For three I have to make some tests.

Sincerely,

Régis
RPIEL is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
trouble indexation pdf lolodev External Binaries 0 07-17-2008 01:47 PM
Indexation localhost roothotgic Troubleshooting 1 06-08-2005 08:45 AM
indexation pdf doc et xls yoann Mod Submissions 0 09-26-2003 08:49 AM


All times are GMT -8. The time now is 12:01 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.