|
10-18-2003, 06:01 AM | #1 |
Orange Mole
Join Date: Oct 2003
Location: hamburg, germany
Posts: 52
|
exclude doesn't really work?
hi!
i just found some text that shouldn't be indexed on my site. i put it into <!-- phpdigExclude --> and <!-- phpdigInclude --> and then reindexed the page. but now it still finds that page, although the text should have been excluded. does anyone know that problem? |
10-18-2003, 07:40 PM | #2 |
Purple Mole
Join Date: Sep 2003
Location: Kassel, Germany
Posts: 119
|
Try this:
file robot_functions.php, add the instruction "continue;" at the line #777 Read this Thread: PHP Code:
|
10-19-2003, 06:30 AM | #3 |
Green Mole
Join Date: Oct 2003
Posts: 2
|
I have the same problem:
Excluded content has been indexed. The "continue"-bug is fixed. I deleted and reistalled the whole phpdig database. I excluded this example to test the exclude function: PHP Code:
Maybe someone can help me, thanks, Holger |
10-19-2003, 06:44 AM | #4 |
Orange Mole
Join Date: Oct 2003
Location: hamburg, germany
Posts: 52
|
same problem with me, the "continue"-thing doesn't seem to fix it.
|
10-19-2003, 07:28 AM | #5 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
PHP Code:
Instead of having the PhpDig exclude and include comments on one line like so: Code:
<!-- phpdigExclude -->some stuff<!-- phpdigInclude --> Code:
<!-- phpdigExclude --> some stuff <!-- phpdigInclude -->
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
10-19-2003, 07:36 AM | #6 |
Orange Mole
Join Date: Oct 2003
Location: hamburg, germany
Posts: 52
|
they are not in one line on my site. that can't be the reason.
|
10-19-2003, 08:51 AM | #7 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Can you check to make sure that your phpdigTestUrl function in robot_functions.php is the same as the one that comes with PhpDig version 1.6.2?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
10-19-2003, 09:12 AM | #8 |
Orange Mole
Join Date: Oct 2003
Location: hamburg, germany
Posts: 52
|
function phpdigTestUrl($url,$mode='simple',$cookies=array()) {
$components = parse_url($url); $lm_date = ''; $status = 'NOFILE'; $auth_string = ''; $redirs = 0; $stop = false; if (isset($components['host'])) { $host = $components["host"]; if (isset($components['user']) && isset($components['pass']) && $components['user'] && $components['pass']) { $auth_string = 'Authorization: Basic '.base64_encode($components['user'].':'.$components['pass'])."\n"; } } else { $host = ''; } if (isset($components['port'])) { $port = (int)$components["port"]; } else { $port = 80; } if (isset($components['path'])) { $path = $components["path"]; } else { $path = ''; } if (isset($components['query'])) { $query = $components["query"]; } else { $query = ''; } $fp = @fsockopen($host,$port); if ($port != 80) { $sport = ":".$port; } else { $sport = ""; } if (!$fp) { //host domain not found $status = "NOHOST"; } else { if ($query) { $path .= "?".$query; } $cookiesSendString = phpDigMakeCookies($cookies,$path); //complete get $request = "HEAD $path HTTP/1.1\n" ."Host: $host$sport\n" .$cookiesSendString .$auth_string ."Accept: */*\n" ."Accept-Charset: ".PHPDIG_ENCODING."\n" ."Accept-Encoding: identity\n" ."User-Agent: PhpDig/".PHPDIG_VERSION." (PHP; MySql)\n\n"; fputs($fp,$request); //test return code while (!$stop && !feof($fp)) { $answer = fgets($fp,8192); //print $answer."<br>\n"; if (isset($req1) && $req1) { //close, and open a new connection //on the new location fclose($fp); $fp = fsockopen($host,$port); if (!$fp) { //host domain not found $status = "NOHOST"; break; } else { fputs($fp,$req1); unset($req1); $answer = fgets($fp,8192); } } if (ereg("HTTP/[0-9.]+ (([0-9])[0-9]{2})", $answer,$regs)) { if ($regs[2] == 2 || $regs[2] == 3) { $code = $regs[2]; } elseif ($regs[1] >= 401 && $regs[1] <= 403) { $status = "UNAUTH"; break; } else { $status = "NOFILE"; break; } } else if (eregi("^ *location: *(.*)",$answer,$regs) && $code == 3) { if ($redirs > 4) { $stop = true; $status = "LOOP"; } $newpath = trim($regs[1]); $newurl = parse_url($newpath); //search if relocation is absolute or relative if (!isset($newurl["host"]) && isset($newurl["path"]) && !ereg('^/',$newurl["path"])) { $path = dirname($path).'/'.$newurl["path"]; } else { $path = $newurl["path"]; } if (!isset($newurl['host']) || !$newurl['host'] || $host == $newurl['host']) { $cookiesSendString = phpDigMakeCookies($cookies,$path); $req1 = "HEAD $path HTTP/1.1\n" ."Host: $host$sport\n" .$cookiesSendString .$auth_string ."Accept: */*\n" ."Accept-Charset: ".PHPDIG_ENCODING."\n" ."Accept-Encoding: identity\n" ."User-Agent: PhpDig/".PHPDIG_VERSION." (PHP; MySql)\n\n"; } else { $stop = true; $status = "NEWHOST"; $host = $newurl['host']; } } //parse cookies elseif (eregi("Set-Cookie: *(([^=]+)=[^; ]+) *(; *path=([^; ]+))* *(; *domain=([^; ]+))*",$answer,$regs)) { $cookies[$regs[2]] = array('string'=>$regs[1],'path'=>$regs[4],'domain'=>$regs[6]); } //Parse content-type header elseif (eregi("Content-Type: *([a-z]+)/([a-z.-]+)",$answer,$regs)) { if ($regs[1] == "text") { switch ($regs[2]) { case 'plain': $status = 'PLAINTEXT'; break; case 'html': $status = 'HTML'; break; default : $status = "NOFILE"; $stop = true; } } else if ($regs[1] == "application") { if ($regs[2] == 'msword' && PHPDIG_INDEX_MSWORD == true) { $status = "MSWORD"; } else if ($regs[2] == 'pdf' && PHPDIG_INDEX_PDF == true) { $status = "PDF"; } else if ($regs[2] == 'vnd.ms-excel' && PHPDIG_INDEX_MSEXCEL == true) { $status = "MSEXCEL"; } else { $status = "NOFILE"; $stop = true; } } else { $status = "NOFILE"; $stop = true; } } elseif (eregi('Last-Modified: *([a-z0-9,: ]+)',$answer,$regs)) { //search last-modified header $lm_date = $regs[1]; } if (!eregi('[a-z0-9]+',$answer)) { $stop = true; } } @fclose($fp); } //returns variable or array if ($mode == 'date') { return compact('status', 'lm_date', 'path', 'host', 'cookies'); } else { return $status; } } that's it. and i haven't changed anything on it, so i guess it should be the correct one. |
10-19-2003, 11:58 AM | #9 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Yep, that looks correct. Let's try echoing out some stuff.
In robot_functions.php, right before: PHP Code:
PHP Code:
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
10-19-2003, 02:41 PM | #10 |
Orange Mole
Join Date: Oct 2003
Location: hamburg, germany
Posts: 52
|
okay. that's what i got:
Content type is: HTML. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. PhpDig include comment found. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. does that help you anything? |
10-19-2003, 02:57 PM | #11 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
The results of the echo test tell me that the PhpDig include comment is being found, but that the PhpDig exclude comment before that is not being found.
Assuming that the PhpDig exclude comment is on one line by itself, maybe there is a typo in the config file. Can you check what PHPDIG_EXCLUDE_COMMENT is set to in the config file? Does this match what is being used in the files that you want to crawl?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
10-19-2003, 03:27 PM | #12 |
Orange Mole
Join Date: Oct 2003
Location: hamburg, germany
Posts: 52
|
yeah i checked that already. there's no mistake in this.
and the other funny thing is: if it would really be this way, that it only finds the exclude but afterwards not the include, then why has it indexed anything at all? strange. |
10-19-2003, 06:41 PM | #13 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
This is strange. It finds the include, so things should get indexed, but it doesn't find the exclude. With exclude before include, and each on their own line, it seems that:
PHP Code:
PHP Code:
PHP Code:
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
10-19-2003, 07:02 PM | #14 |
Orange Mole
Join Date: Oct 2003
Location: hamburg, germany
Posts: 52
|
same thing again. it only finds the include. i'd mistaken that in my last post, so it's actually not that strange, because it does only find the include and that's already a good explanation for why it doesn't exclude what i want to get excluded.
and yes, the include- and exclude-comments are not in the same line. now i tried another thing. i made a file test.php with nothing but <!-- phpdigExclude --> word <!-- phpdigInclude --> in it. and now the result of the spider: Content type is HTML and PhpDig exclude comment found. Content type is: HTML. PhpDig include comment found. now that seems to be the way it should be - right? the one i tried before was quite a big page. and something in that must have caused the error. now that could become quite a needle in a haystack... any ideas? wanna see the html? |
10-19-2003, 07:40 PM | #15 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Yep, that test.php file is the way it should be, so maybe it is something with the big page like you say. You are right that the include and exclude should not be on the same line.
Also, the include and exclude need to be on lines by themselves. Maybe try editing the big page in a text only editor to make sure that the include and exclude comments are on lines by themselves and no soft wrapping is going on there. If you want, just post a snippet of the html around the exclude comment, like +/- 10 or so lines.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
New Exclude Option | josegringo | How-to Forum | 2 | 02-17-2005 03:48 PM |
Can't exclude few pages | mleray | Troubleshooting | 2 | 11-19-2004 01:25 AM |
exclude metatags | tomas | How-to Forum | 5 | 08-15-2004 04:22 PM |
Exclude list? | antun | How-to Forum | 5 | 03-10-2004 12:38 PM |
exclude after spidering | baskamer | Troubleshooting | 2 | 03-01-2004 03:17 AM |