Indexation problems

RPIEL · 02-10-2004, 09:23 AM

Hi,

I encounter 3 different problems:

1) Most of my urls look like this :
http://resoform/services/index.php?s...a-liste&m=tous

the script which is called is always index.php.
For each page found the spider tries to index index.php without the query string. That makes no sense. The complete URIs are well indexed but the url "index.php" (without query string) is also found each time and identified as a doublon. Is it possible to change this ?

2) I have some listing pages where the navigation operates with links through the different pages. Because links are provided to the first, last, next or previous page the number of levels required to visit all the items of the list may be very important (perhaps greater than the phpDig limit). Is there a solution to this problem ?

3) I used phpDig html comments to exclude and include some parts of the html code. However I saw that some links which should have been excluded where visited. I don't feel that is normal. Does the exclude comment stop the content being indexed and the links being followed ?

Thanks for your help, and sorry to ask 3 questions at a time

Régis

tomas · 02-11-2004, 01:58 AM

hello rpiel,

try:
1) config.php line 97: define('PHPDIG_DEFAULT_INDEX',false);
set false to true
2) config.php line 84-86: define('SPIDER_MAX_LIMIT',100);
define('SPIDER_DEFAULT_LIMIT',100);
define('RESPIDER_LIMIT',100);
set limit eg. to 100 or more
3) you have to use the expression set in line 92 and 94 in
config.php: default 
in your html code use exclusive lines eg.
<html>
<body>
text to be searched....

text not to be searched....

......
</body>
</html>

hope this helps a little
tomas

RPIEL · 02-11-2004, 04:49 AM

Hello Tomas,

Thank you for your answer.

1) defining PHPDIG_DEFAULT_INDEX to true did'nt solve my problem : now the url http://resodorm/services/ is indexed x times.
This takes a few seconds each time, after which the spider sees that there is a doublon. Anyway, in my system the script "index.php" isn't a page by itself but only with dynamic inclusion of other scripts and templates. It does'nt make sense to crawl it.

2) OK, putting the limit very high seems to be a solution.
However when I do this the same pages are indexed a lot of times ans the whole process takes several hours when it should takes about 15 minutes...
I think it would be a more reliable solution to maintain (or compute) a page containing all the links to index. This might solve the problem of requiring multiple levels to go through the items of a list.

3) I did use  comments in my pages, but sometimes the result was not what I expected.

sincerely

Régis

RPIEL · 02-11-2004, 05:24 AM

About  comments :

They appear to work in indexing or not indexing the content of a page. The words in excludes parts of the document are not indexed.
However it seems to me that the links in these parts are followed. That is exactly what I want to avoid !

Has anyone dealed with this issue ?

Thanks by advance,

Régis

Charter · 02-11-2004, 12:44 PM

Hi. For one, try the code in this post but replace:

PHP Code:


			
//exclude if specific variable set

if (strpos($link['file'],'print=y')) {

$link['ok'] = 0;

}

with the following:

PHP Code:


			
//exclude if specific link

if (eregi('index.php$',$link['file'])) {

$link['ok'] = 0;

}

For two, it's probably faster to index your site in pieces, or start the index process using a different page.

For three, this thread may help.

RPIEL · 02-11-2004, 11:52 PM

Hi Charter,

Thank you for your answer.
I get the better results in building a special page for the indexing process.

This page has a robots meta tag with "noindex, follow" content.

Because the parts of the site I want to index can be retrieved by a simple query on my database, building this special page was easy.
Now I only need a depth of 1 for spidering process and it runs very fast.
All the pages I want to be indexed are retrieved just one time.

I think it is always (when possible), a good solution to build special pages in order to index the site. The path can be simplified and this will avoid to test a lot of pages to see if there are doublons.

For three I have to make some tests.

Sincerely,

Régis

02-10-2004, 09:23 AM	#1
RPIEL Green Mole Join Date: Feb 2004 Posts: 7	Indexation problems Hi, I encounter 3 different problems: 1) Most of my urls look like this : http://resoform/services/index.php?s...a-liste&m=tous the script which is called is always index.php. For each page found the spider tries to index index.php without the query string. That makes no sense. The complete URIs are well indexed but the url "index.php" (without query string) is also found each time and identified as a doublon. Is it possible to change this ? 2) I have some listing pages where the navigation operates with links through the different pages. Because links are provided to the first, last, next or previous page the number of levels required to visit all the items of the list may be very important (perhaps greater than the phpDig limit). Is there a solution to this problem ? 3) I used phpDig html comments to exclude and include some parts of the html code. However I saw that some links which should have been excluded where visited. I don't feel that is normal. Does the exclude comment stop the content being indexed and the links being followed ? Thanks for your help, and sorry to ask 3 questions at a time Régis __________________ Régis

02-11-2004, 12:44 PM	#5
Charter Head Mole Join Date: May 2003 Posts: 2,539	Hi. For one, try the code in this post but replace: PHP Code: `//exclude if specific variable set if (strpos($link['file'],'print=y')) { $link['ok'] = 0; }` with the following: PHP Code: `//exclude if specific link if (eregi('index.php$',$link['file'])) { $link['ok'] = 0; }` For two, it's probably faster to index your site in pieces, or start the index process using a different page. For three, this thread may help. __________________ Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
trouble indexation pdf	lolodev	External Binaries	0	07-17-2008 01:47 PM
Indexation localhost	roothotgic	Troubleshooting	1	06-08-2005 08:45 AM
indexation pdf doc et xls	yoann	Mod Submissions	0	09-26-2003 08:49 AM

02-11-2004, 01:58 AM	#2
tomas Orange Mole Join Date: Feb 2004 Posts: 47	hello rpiel, try: 1) config.php line 97: define('PHPDIG_DEFAULT_INDEX',false); set false to true 2) config.php line 84-86: define('SPIDER_MAX_LIMIT',100); define('SPIDER_DEFAULT_LIMIT',100); define('RESPIDER_LIMIT',100); set limit eg. to 100 or more 3) you have to use the expression set in line 92 and 94 in config.php: default <!-- phpdigExclude --><!-- phpdigInclude --> in your html code use exclusive lines eg. <html> <body> text to be searched.... <!-- phpdigExclude --> text not to be searched.... <!-- phpdigInclude --> ...... </body> </html> hope this helps a little tomas

02-11-2004, 04:49 AM	#3
RPIEL Green Mole Join Date: Feb 2004 Posts: 7	Hello Tomas, Thank you for your answer. 1) defining PHPDIG_DEFAULT_INDEX to true did'nt solve my problem : now the url http://resodorm/services/ is indexed x times. This takes a few seconds each time, after which the spider sees that there is a doublon. Anyway, in my system the script "index.php" isn't a page by itself but only with dynamic inclusion of other scripts and templates. It does'nt make sense to crawl it. 2) OK, putting the limit very high seems to be a solution. However when I do this the same pages are indexed a lot of times ans the whole process takes several hours when it should takes about 15 minutes... I think it would be a more reliable solution to maintain (or compute) a page containing all the links to index. This might solve the problem of requiring multiple levels to go through the items of a list. 3) I did use <!-- phpdigExclude --><!-- phpdigInclude --> comments in my pages, but sometimes the result was not what I expected. sincerely Régis

02-11-2004, 05:24 AM	#4
RPIEL Green Mole Join Date: Feb 2004 Posts: 7	About <!-- phpdigExclude --><!-- phpdigInclude --> comments : They appear to work in indexing or not indexing the content of a page. The words in excludes parts of the document are not indexed. However it seems to me that the links in these parts are followed. That is exactly what I want to avoid ! Has anyone dealed with this issue ? Thanks by advance, Régis

02-11-2004, 11:52 PM	#6
RPIEL Green Mole Join Date: Feb 2004 Posts: 7	Hi Charter, Thank you for your answer. I get the better results in building a special page for the indexing process. This page has a robots meta tag with "noindex, follow" content. Because the parts of the site I want to index can be retrieved by a simple query on my database, building this special page was easy. Now I only need a depth of 1 for spidering process and it runs very fast. All the pages I want to be indexed are retrieved just one time. I think it is always (when possible), a good solution to build special pages in order to index the site. The path can be simplified and this will avoid to test a lot of pages to see if there are doublons. For three I have to make some tests. Sincerely, Régis