|
08-24-2004, 01:15 AM | #1 |
Green Mole
Join Date: Aug 2004
Posts: 3
|
Some fixes for phpDigCleanHtml()
I was confused by results of indexing of one site. I look into phpDigCleanHtml() and see, that regexp for searching tags are not powerfull. Take a look:
//extracts title if ( eregi("<title *>([^<>]*)</title *>",$text,$regs) ) { If in page title stored as <TITLE>Title of my site</TITLE> this code is not work. more powerful is: preg_match("/< *title *>(.*?)< */ *title *>/i",$text,$regs) same is with code: //delete content of head, script, and style tags $text = eregi_replace("<head[^>]*>.*</head>"," ",$text); //$text = eregi_replace("<script[^>]*>.*</script>"," ",$text); // more conservative $text = preg_replace("/<script[^>]*?>.*?<\/script>/is","",$text); // less conservative $text = eregi_replace("<style[^>]*>.*</style>"," ",$text); i think, it will be better to replace any tag by space, for example modify $text = ereg_replace("[[:space:]]+"," ",eregi_replace("<[^>]*>","",$text)); with $text = ereg_replace("[[:space:]]+"," ",eregi_replace("<[^>]*>"," ",$text)); (now <td>Hello</td><td>Pavel</td> will be indexed correctly) PS: Sorry for my english. |
08-24-2004, 01:19 AM | #2 |
Green Mole
Join Date: Aug 2004
Posts: 3
|
There was an error
read
preg_match("/< *title *>(.*?)< */ *title *>/i",$text,$regs) as preg_match('/< *title *>(.*?)< *\/ *title *>/i',$text,$regs) |
08-24-2004, 01:46 AM | #3 |
Green Mole
Join Date: Aug 2004
Posts: 3
|
Bad example
I write bad example, sorry. Try to dig html with this title:
<TITLE>HOME > NEWS</TITLE> I know, that > must be written as >, but not all webmasters know this |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
phpdigCleanHtml clean too much | Jer | Mod Submissions | 5 | 04-12-2004 03:57 AM |