obottek
08-24-2006, 06:30 AM
In my vanilla installation of 1.8.9 RC1 indexing certain sites, pages and all pdfs or any other files through the external binaries is not working.
Situation
You might see one or more of the following:
_the site you submit for spidering does not contain any links and is not added (no "yes" or "+" in front of your first page), therefor site indexing stops
_certain pages are listed when spidering but do not contain links and are not added (no "yes" or "+" in front of the page), links from these pages are skipped
_pdfs, docs, xls, ... are listed when spidering but are not added (no "yes" or "+" in front of the document)
Problem
In my view, the problem can be found in the function phpdigMakeUTF8 in robot_functions.php (starting somewhere line 1735).
This function takes the content of a page or (converted) file and tries to character convert it to UTF8. Therefor it does a few checks to find out the current charset and here goes the problem. For all documents, html pages and files too, it look for a meta tag "content-type" inside "head"-tags. It should look like this:
<head>
...
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
...
</head>
If such a meta tags inside "head"-tags is not found, it actually returns no content at all to the spider. Therefor, this document is not indexed and saved to the database and of course also no links in it can be found. It's pretty clear, that any pdf or other files would never contain a "head"-area and any "meta"-tags and are therefor skipped at all.
Workaround
So if the searched meta-tag is not found, the function should at least return the unconverted string instead of nothing. I haven't looked into it to much. Maybe the charset detection function should also be allowed to override the result from the not found meta-tag.
However this workaround actually treats all pages and files without the above named meta-tag as UTF8 encoded. For pdf's make sure you have the '-enc UTF-8' option of pdftotext enabled.
In robotfunctions.php, ca. line 1832 replace:
if ($no_convert == 0) {
if (ENABLE_JPKANA == true) {
$string = @mb_convert_kana($string,CONVERT_JPKANA,"UTF-8");
}
return $string;
}
else {
return 0;
}
by
if ($no_convert == 0) {
if (ENABLE_JPKANA == true) {
$string = @mb_convert_kana($string,CONVERT_JPKANA,"UTF-8");
}
return $string;
}
else {
return $string;
}
It's actually only replacing return 0; by return $string;.
Improvements
If someone has a patch to actually allowing the charset detection to override the "unknown charset", please add it here. This would make this workaround a fix. ;)
Greetings,
Olaf
Situation
You might see one or more of the following:
_the site you submit for spidering does not contain any links and is not added (no "yes" or "+" in front of your first page), therefor site indexing stops
_certain pages are listed when spidering but do not contain links and are not added (no "yes" or "+" in front of the page), links from these pages are skipped
_pdfs, docs, xls, ... are listed when spidering but are not added (no "yes" or "+" in front of the document)
Problem
In my view, the problem can be found in the function phpdigMakeUTF8 in robot_functions.php (starting somewhere line 1735).
This function takes the content of a page or (converted) file and tries to character convert it to UTF8. Therefor it does a few checks to find out the current charset and here goes the problem. For all documents, html pages and files too, it look for a meta tag "content-type" inside "head"-tags. It should look like this:
<head>
...
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
...
</head>
If such a meta tags inside "head"-tags is not found, it actually returns no content at all to the spider. Therefor, this document is not indexed and saved to the database and of course also no links in it can be found. It's pretty clear, that any pdf or other files would never contain a "head"-area and any "meta"-tags and are therefor skipped at all.
Workaround
So if the searched meta-tag is not found, the function should at least return the unconverted string instead of nothing. I haven't looked into it to much. Maybe the charset detection function should also be allowed to override the result from the not found meta-tag.
However this workaround actually treats all pages and files without the above named meta-tag as UTF8 encoded. For pdf's make sure you have the '-enc UTF-8' option of pdftotext enabled.
In robotfunctions.php, ca. line 1832 replace:
if ($no_convert == 0) {
if (ENABLE_JPKANA == true) {
$string = @mb_convert_kana($string,CONVERT_JPKANA,"UTF-8");
}
return $string;
}
else {
return 0;
}
by
if ($no_convert == 0) {
if (ENABLE_JPKANA == true) {
$string = @mb_convert_kana($string,CONVERT_JPKANA,"UTF-8");
}
return $string;
}
else {
return $string;
}
It's actually only replacing return 0; by return $string;.
Improvements
If someone has a patch to actually allowing the charset detection to override the "unknown charset", please add it here. This would make this workaround a fix. ;)
Greetings,
Olaf