Certain sites, pages and pdfs are not indexed 1.8.9 RC1 [Workaround included]

obottek · 08-24-2006, 06:30 AM

In my vanilla installation of 1.8.9 RC1 indexing certain sites, pages and all pdfs or any other files through the external binaries is not working.

Situation
You might see one or more of the following:
_the site you submit for spidering does not contain any links and is not added (no "yes" or "+" in front of your first page), therefor site indexing stops
_certain pages are listed when spidering but do not contain links and are not added (no "yes" or "+" in front of the page), links from these pages are skipped
_pdfs, docs, xls, ... are listed when spidering but are not added (no "yes" or "+" in front of the document)

Problem
In my view, the problem can be found in the function phpdigMakeUTF8 in robot_functions.php (starting somewhere line 1735).

This function takes the content of a page or (converted) file and tries to character convert it to UTF8. Therefor it does a few checks to find out the current charset and here goes the problem. For all documents, html pages and files too, it look for a meta tag "content-type" inside "head"-tags. It should look like this:

Code:

<head>
...
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
...
</head>

If such a meta tags inside "head"-tags is not found, it actually returns no content at all to the spider. Therefor, this document is not indexed and saved to the database and of course also no links in it can be found. It's pretty clear, that any pdf or other files would never contain a "head"-area and any "meta"-tags and are therefor skipped at all.

Workaround
So if the searched meta-tag is not found, the function should at least return the unconverted string instead of nothing. I haven't looked into it to much. Maybe the charset detection function should also be allowed to override the result from the not found meta-tag.

However this workaround actually treats all pages and files without the above named meta-tag as UTF8 encoded. For pdf's make sure you have the '-enc UTF-8' option of pdftotext enabled.

In robotfunctions.php, ca. line 1832 replace:

Code:

  if ($no_convert == 0) {
    if (ENABLE_JPKANA == true) {
      $string = @mb_convert_kana($string,CONVERT_JPKANA,"UTF-8");
    }
    return $string;
  }
  else {
    return 0;
  }

by

Code:

  if ($no_convert == 0) {
    if (ENABLE_JPKANA == true) {
      $string = @mb_convert_kana($string,CONVERT_JPKANA,"UTF-8");
    }
    return $string;
  }
  else {
    return $string;
  }

It's actually only replacing return 0; by return $string;.

Improvements
If someone has a patch to actually allowing the charset detection to override the "unknown charset", please add it here. This would make this workaround a fix.

Greetings,
Olaf

08-24-2006, 06:30 AM	#1
obottek Green Mole Join Date: Sep 2003 Posts: 15	Certain sites, pages and pdfs are not indexed 1.8.9 RC1 [Workaround included] In my vanilla installation of 1.8.9 RC1 indexing certain sites, pages and all pdfs or any other files through the external binaries is not working. Situation You might see one or more of the following: _the site you submit for spidering does not contain any links and is not added (no "yes" or "+" in front of your first page), therefor site indexing stops _certain pages are listed when spidering but do not contain links and are not added (no "yes" or "+" in front of the page), links from these pages are skipped _pdfs, docs, xls, ... are listed when spidering but are not added (no "yes" or "+" in front of the document) Problem In my view, the problem can be found in the function phpdigMakeUTF8 in robot_functions.php (starting somewhere line 1735). This function takes the content of a page or (converted) file and tries to character convert it to UTF8. Therefor it does a few checks to find out the current charset and here goes the problem. For all documents, html pages and files too, it look for a meta tag "content-type" inside "head"-tags. It should look like this: Code: <head> ... <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> ... </head> If such a meta tags inside "head"-tags is not found, it actually returns no content at all to the spider. Therefor, this document is not indexed and saved to the database and of course also no links in it can be found. It's pretty clear, that any pdf or other files would never contain a "head"-area and any "meta"-tags and are therefor skipped at all. Workaround So if the searched meta-tag is not found, the function should at least return the unconverted string instead of nothing. I haven't looked into it to much. Maybe the charset detection function should also be allowed to override the result from the not found meta-tag. However this workaround actually treats all pages and files without the above named meta-tag as UTF8 encoded. For pdf's make sure you have the '-enc UTF-8' option of pdftotext enabled. In robotfunctions.php, ca. line 1832 replace: Code: if ($no_convert == 0) { if (ENABLE_JPKANA == true) { $string = @mb_convert_kana($string,CONVERT_JPKANA,"UTF-8"); } return $string; } else { return 0; } by Code: if ($no_convert == 0) { if (ENABLE_JPKANA == true) { $string = @mb_convert_kana($string,CONVERT_JPKANA,"UTF-8"); } return $string; } else { return $string; } It's actually only replacing return 0; by return $string;. Improvements If someone has a patch to actually allowing the charset detection to override the "unknown charset", please add it here. This would make this workaround a fix. Greetings, Olaf

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
List how many Sites have been indexed?	Dan	Mod Requests	1	11-17-2006 07:00 AM
Spider CLI call under Windows fails [Workaround included]	shockfreezer	Bug Tracker	0	06-22-2006 07:58 AM
Exclude not working 1.8.9 RC1 (Workaround included)	shockfreezer	Bug Tracker	1	05-16-2006 01:55 AM
show sites indexed	richwilson	How-to Forum	2	04-06-2006 05:31 PM
Nothing indexed on some sites	tryangle	How-to Forum	3	05-02-2004 01:31 PM