|
01-09-2005, 07:55 AM | #1 |
Green Mole
Join Date: Jan 2005
Posts: 3
|
urls with collection of weird characters
Hi,
phpdig is having trouble indexing large parts of my intranet. I think I have tracked the problem down to the following. A large number of urls are use number of special characters and I think that they are just no being picked up by phpdig. The following is missed for example: http://my.intranet.com/WBSITE/INTRAN...489784,00.html This are static urls, not dynamic urls. Where can I change the the regular expression and how. So that these pages get indexed. Bert |
01-09-2005, 08:20 AM | #2 |
Purple Mole
Join Date: Jan 2004
Posts: 694
|
Welcome to the forum, revenazb.
I suspect you need to change the value of the variable called PHPDIG_ENCODING in the config file to match the character set in your intranet. |
01-09-2005, 09:11 AM | #3 |
Green Mole
Join Date: Jan 2005
Posts: 3
|
Hi,
Thanks for the welcome. It is not so much the character set as the site is in english. The problem is with the url of the page, it contains a lot of comas, colons, tilds etc. Looking at the regular expression that I think phpdig uses it looks like it would not pick up the url I put above. While the individual characters are in there I don't think the pattern would be picked. I am not very good with regular expressions. Im putting the line of code that I think is relevant below. Regular expression: Code:
"(<frame[^>]*src[[:blank:]]*=|href[[:blank:]]*=|HREF[[:blank:]]*=|http-equiv=['\"]refresh['\"] *content=['\"][0-9]+;[[:blank:]]*url[[:blank:]]*=|window[.]location[[:blank:]]*=|window[.]open[[:blank:]]*[(])[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*([:%/?=&;\\,._a-zA-Z0-9\|+ ()~-]*))(#[.a-zA-Z0-9-]*)?[\'\" ]?" Code:
my.intranet.com/WBSITE/INTRANET/UNITS/INTINFNETWORK/0,,contentMDK:20295425~pagePK:64156298~piPK:64152276~theSitePK:489784,00.html Last edited by revenazb; 01-09-2005 at 09:47 AM. Reason: Small Typo |
01-09-2005, 09:36 AM | #4 |
Purple Mole
Join Date: Jan 2004
Posts: 694
|
I don't even know how to read a URL like that, let alone modify the regular expression so that phpdig would index it. If someone else doesn't come along that helps you with modifying that, I know of another great forum (here) where you could probably get some help with that regular expression.
|
01-09-2005, 10:15 AM | #5 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
If the link were as follows:
Code:
http://www.domain.com/dir/0,,contentMDK:20295425~pagePK:64156298~piPK:64152276~theSitePK:489784,00.html Code:
127.0.0.1 - - [09/Jan/2005:10:00:30 -0800] "HEAD /20295425~pagePK:64156298~piPK:64152276~theSitePK:489784,00.html HTTP/1.1" 404 0 "-" "PhpDig/1.8.6 (+http://www.phpdig.net/robot.php)" There are actually two spots in robot_functions.php to edit: - One Code:
while (eregi("(<frame[^>]*src[[:blank:]]*=|href[[:blank:]]*=|http-equiv=['\"]refresh['\"] *content=['\"][0-9]+;[[:blank:]]*url[[:blank:]]*=|window[.]location[[:blank:]]*=|window[.]open[[:blank:]]*[(])[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*([:%/?=&;\\,._a-zA-Z0-9\|+ ()~-]*))(#[.a-zA-Z0-9-]*)?[\'\" ]?",$eval,$regs)) { Code:
while(eregi("<a([^>]*href[[:blank:]]*=[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*([:%/?=&;\\,._a-zA-Z0-9 ()~-]*))[#\'\" ]?)",$line,$regs)) { Oh, and vB inserts space if there are too many chars in a row without space, so take that into account when considering the code posted herein.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
01-09-2005, 07:16 PM | #6 |
Green Mole
Join Date: Jan 2005
Posts: 3
|
Thanks for the help,
I have identified the problem. It is not in the regular expressions above but in the function rewrite urls were the following line $url = @parse_url(str_replace('\'"','',$eval)); Should be replace with list($url['path'], $url['query']) = split("\?", str_replace('\'"','',$eval)); What happens is the parse_url function interprets the firs colon in the path and therefore messes up. Using the split funtion fixes this problem. Hope this helps someone else |
01-10-2005, 02:09 AM | #7 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Yes, I see what you say about parse_url with those type of links.
If you plan to use: Code:
list($url['path'], $url['query']) = split("\?", str_replace('\'"','',$eval)); Code:
$url = @parse_url(str_replace('\'"','',$eval)); Code:
if (!eregi("[?]",$eval)) { $eval .= "?"; } Code:
list($url['path'], $url['query']) = split("\?", str_replace('\'"','',$eval)); Note: it's not enough to comment out the "remove ending question mark" line as phpdigRewriteUrl is called in various places with various content.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Weird Indexing type problem | silverfox | Script Installation | 1 | 08-10-2007 07:16 AM |
A weird thought | Charter | The Mole Hole | 2 | 12-22-2004 02:27 PM |
garbage collection | baskamer | How-to Forum | 1 | 12-19-2004 10:28 AM |
ignore special characters like - | mirdin | Troubleshooting | 5 | 09-11-2004 07:48 AM |
hmm.. a bit weird ..? | zevince | Troubleshooting | 6 | 12-02-2003 08:41 AM |