PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Bug Tracker

Reply
 
Thread Tools
Old 08-24-2004, 01:15 AM   #1
pavel
Green Mole
 
Join Date: Aug 2004
Posts: 3
Some fixes for phpDigCleanHtml()

I was confused by results of indexing of one site. I look into phpDigCleanHtml() and see, that regexp for searching tags are not powerfull. Take a look:

//extracts title
if ( eregi("<title *>([^<>]*)</title *>",$text,$regs) ) {

If in page title stored as <TITLE>Title of my site</TITLE> this code is not work.
more powerful is:

preg_match("/< *title *>(.*?)< */ *title *>/i",$text,$regs)

same is with code:

//delete content of head, script, and style tags
$text = eregi_replace("<head[^>]*>.*</head>"," ",$text);
//$text = eregi_replace("<script[^>]*>.*</script>"," ",$text); // more conservative
$text = preg_replace("/<script[^>]*?>.*?<\/script>/is","",$text); // less conservative
$text = eregi_replace("<style[^>]*>.*</style>"," ",$text);

i think, it will be better to replace any tag by space, for example modify

$text = ereg_replace("[[:space:]]+"," ",eregi_replace("<[^>]*>","",$text));

with

$text = ereg_replace("[[:space:]]+"," ",eregi_replace("<[^>]*>"," ",$text));

(now <td>Hello</td><td>Pavel</td> will be indexed correctly)

PS: Sorry for my english.
pavel is offline   Reply With Quote
Old 08-24-2004, 01:19 AM   #2
pavel
Green Mole
 
Join Date: Aug 2004
Posts: 3
There was an error

read

preg_match("/< *title *>(.*?)< */ *title *>/i",$text,$regs)

as

preg_match('/< *title *>(.*?)< *\/ *title *>/i',$text,$regs)
pavel is offline   Reply With Quote
Old 08-24-2004, 01:46 AM   #3
pavel
Green Mole
 
Join Date: Aug 2004
Posts: 3
Bad example

I write bad example, sorry. Try to dig html with this title:
<TITLE>HOME > NEWS</TITLE>
I know, that > must be written as &gt, but not all webmasters know this
pavel is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
phpdigCleanHtml clean too much Jer Mod Submissions 5 04-12-2004 03:57 AM


All times are GMT -8. The time now is 10:13 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.