PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Bug Tracker

Reply
 
Thread Tools
Old 08-24-2006, 06:30 AM   #1
obottek
Green Mole
 
Join Date: Sep 2003
Posts: 15
Certain sites, pages and pdfs are not indexed 1.8.9 RC1 [Workaround included]

In my vanilla installation of 1.8.9 RC1 indexing certain sites, pages and all pdfs or any other files through the external binaries is not working.

Situation
You might see one or more of the following:
_the site you submit for spidering does not contain any links and is not added (no "yes" or "+" in front of your first page), therefor site indexing stops
_certain pages are listed when spidering but do not contain links and are not added (no "yes" or "+" in front of the page), links from these pages are skipped
_pdfs, docs, xls, ... are listed when spidering but are not added (no "yes" or "+" in front of the document)

Problem
In my view, the problem can be found in the function phpdigMakeUTF8 in robot_functions.php (starting somewhere line 1735).

This function takes the content of a page or (converted) file and tries to character convert it to UTF8. Therefor it does a few checks to find out the current charset and here goes the problem. For all documents, html pages and files too, it look for a meta tag "content-type" inside "head"-tags. It should look like this:

Code:
<head>
...
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
...
</head>
If such a meta tags inside "head"-tags is not found, it actually returns no content at all to the spider. Therefor, this document is not indexed and saved to the database and of course also no links in it can be found. It's pretty clear, that any pdf or other files would never contain a "head"-area and any "meta"-tags and are therefor skipped at all.

Workaround
So if the searched meta-tag is not found, the function should at least return the unconverted string instead of nothing. I haven't looked into it to much. Maybe the charset detection function should also be allowed to override the result from the not found meta-tag.

However this workaround actually treats all pages and files without the above named meta-tag as UTF8 encoded. For pdf's make sure you have the '-enc UTF-8' option of pdftotext enabled.

In robotfunctions.php, ca. line 1832 replace:
Code:
  if ($no_convert == 0) {
    if (ENABLE_JPKANA == true) {
      $string = @mb_convert_kana($string,CONVERT_JPKANA,"UTF-8");
    }
    return $string;
  }
  else {
    return 0;
  }
by
Code:
  if ($no_convert == 0) {
    if (ENABLE_JPKANA == true) {
      $string = @mb_convert_kana($string,CONVERT_JPKANA,"UTF-8");
    }
    return $string;
  }
  else {
    return $string;
  }
It's actually only replacing return 0; by return $string;.

Improvements
If someone has a patch to actually allowing the charset detection to override the "unknown charset", please add it here. This would make this workaround a fix.

Greetings,
Olaf
obottek is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
List how many Sites have been indexed? Dan Mod Requests 1 11-17-2006 07:00 AM
Spider CLI call under Windows fails [Workaround included] shockfreezer Bug Tracker 0 06-22-2006 07:58 AM
Exclude not working 1.8.9 RC1 (Workaround included) shockfreezer Bug Tracker 1 05-16-2006 01:55 AM
show sites indexed richwilson How-to Forum 2 04-06-2006 05:31 PM
Nothing indexed on some sites tryangle How-to Forum 3 05-02-2004 01:31 PM


All times are GMT -8. The time now is 07:31 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.