|
08-24-2006, 05:30 AM | #1 |
Green Mole
Join Date: Sep 2003
Posts: 15
|
Certain sites, pages and pdfs are not indexed 1.8.9 RC1 [Workaround included]
In my vanilla installation of 1.8.9 RC1 indexing certain sites, pages and all pdfs or any other files through the external binaries is not working.
Situation You might see one or more of the following: _the site you submit for spidering does not contain any links and is not added (no "yes" or "+" in front of your first page), therefor site indexing stops _certain pages are listed when spidering but do not contain links and are not added (no "yes" or "+" in front of the page), links from these pages are skipped _pdfs, docs, xls, ... are listed when spidering but are not added (no "yes" or "+" in front of the document) Problem In my view, the problem can be found in the function phpdigMakeUTF8 in robot_functions.php (starting somewhere line 1735). This function takes the content of a page or (converted) file and tries to character convert it to UTF8. Therefor it does a few checks to find out the current charset and here goes the problem. For all documents, html pages and files too, it look for a meta tag "content-type" inside "head"-tags. It should look like this: Code:
<head> ... <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> ... </head> Workaround So if the searched meta-tag is not found, the function should at least return the unconverted string instead of nothing. I haven't looked into it to much. Maybe the charset detection function should also be allowed to override the result from the not found meta-tag. However this workaround actually treats all pages and files without the above named meta-tag as UTF8 encoded. For pdf's make sure you have the '-enc UTF-8' option of pdftotext enabled. In robotfunctions.php, ca. line 1832 replace: Code:
if ($no_convert == 0) { if (ENABLE_JPKANA == true) { $string = @mb_convert_kana($string,CONVERT_JPKANA,"UTF-8"); } return $string; } else { return 0; } Code:
if ($no_convert == 0) { if (ENABLE_JPKANA == true) { $string = @mb_convert_kana($string,CONVERT_JPKANA,"UTF-8"); } return $string; } else { return $string; } Improvements If someone has a patch to actually allowing the charset detection to override the "unknown charset", please add it here. This would make this workaround a fix. Greetings, Olaf |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
List how many Sites have been indexed? | Dan | Mod Requests | 1 | 11-17-2006 06:00 AM |
Spider CLI call under Windows fails [Workaround included] | shockfreezer | Bug Tracker | 0 | 06-22-2006 06:58 AM |
Exclude not working 1.8.9 RC1 (Workaround included) | shockfreezer | Bug Tracker | 1 | 05-16-2006 12:55 AM |
show sites indexed | richwilson | How-to Forum | 2 | 04-06-2006 04:31 PM |
Nothing indexed on some sites | tryangle | How-to Forum | 3 | 05-02-2004 12:31 PM |