|
04-09-2004, 12:15 PM | #1 |
Green Mole
Join Date: Apr 2004
Posts: 4
|
phpdigCleanHtml clean too much
with these twolines in function phpdigCleanHtml from robot_functions.php :
$text = eregi_replace("<script[^>]*>.*</script>"," ",$text); $text = eregi_replace("<style[^>]*>.*</style>"," ",$text); if we have by example : <script> fdlsm </script> important information <script> fdlsm </script> the text : "important information" will be erase because the ereg function will take the first <script> and the last </script> the correction may be : $text = eregi_replace("<script[^>]*>([^<]+)?</script>","",$text); $text = eregi_replace("<style[^>]*>([^<]+)?</style>","",$text); |
04-11-2004, 02:55 PM | #2 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. In the config file perhaps try setting define('CHUNK_SIZE',2048); to a lower number, that number being something small enough so that the 'important information' being cleaned isn't contained between first-last tags like those posted.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
04-11-2004, 05:18 PM | #3 |
Green Mole
Join Date: Apr 2004
Posts: 4
|
i think my explanation was bad.
so i give you an example : with the page html : ****************** <script></script> some text, html code, a usually page in html ... <script></script> another part for text ... <script></script> ****************** this operation : eregi("<script[^>]*>(.*)</script>",$txt,$regs); will fill the variable regs with : *********************** print_r(regs): array( [0] => <script></script> some text, html code, a usually page in html ... <script></script> another part for text ... <script></script> [1] => </script> some text, html code, a usually page in html ... <script></script> another part for text ... <script> ) ********************** As you can see, the regs[1] contain all the html code ! so the clean function will just cut all the page if it contain a script tag at the beginning and at the end ! and javascript is pretty popular ! |
04-11-2004, 08:40 PM | #4 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. I understood.
Another approach would be to change define('CHUNK_SIZE',2048); to something small like define('CHUNK_SIZE',20); in the config file and then index. The chunk size is basically the string length of a chunk of text sent to the phpdigCleanHtml function. A smaller chunk size should pick up text between tags but may increase index time. Of course, TMTOWTDI.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
04-12-2004, 02:40 AM | #5 |
Green Mole
Join Date: Apr 2004
Posts: 4
|
ah ok i understand.
i thought, i took the whole page with a big chunk size |
04-12-2004, 03:57 AM | #6 |
Green Mole
Join Date: Apr 2004
Posts: 4
|
a better correction should be :
$txt = preg_replace("/<TAG[^>]*>(.*?)<\/TAG>/is",$txt); the '?' make the preg function lazy and stop at the first match. Moreover preg functions are faster than ereg functions. Last edited by Jer; 04-12-2004 at 04:00 AM. |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
clean dashes?? | vispa | How-to Forum | 1 | 02-26-2005 08:24 PM |
Some fixes for phpDigCleanHtml() | pavel | Bug Tracker | 2 | 08-24-2004 01:46 AM |
PHP dig not indexing site on clean install... | mixonic | Troubleshooting | 1 | 06-28-2004 09:15 AM |
Clean a PC with autoexec.bat | Charter | Coding & Tutorials | 3 | 03-05-2004 01:28 PM |