|
11-11-2003, 09:51 AM | #1 |
Orange Mole
Join Date: Nov 2003
Posts: 42
|
Version 1.6.3 and some bugs/ideas
A new member has joined!
Just installed the new version and some bugs/ideas came to my mind. 1. If filename has & char this cannot be spidered (maybe also some other special ones) - Solution is to change this in robot_functions $path = $url['path']; to $path = ereg_replace('& amp;*','&',$url['path']); // edit: remove space between & and amp 2. If using M$ environment is_executable function is not available until PHP 5.0.0. Comment those out and external binaries will start to work. 3. Antiword cannot handle long filenames, only DOS 8.3. Change $temp_filename to something like this: srand ((double)microtime()*1000000); $temp_filename = rand(0,999999).$suffix; and remember to change also $suffix. Question: temp directory is no cleared totally after these mods - how to do this? 4. How to change %20 in file or folder names to space? (I have a mod for this but it is so quick and dirty) Otherwise I think this is great peace of software! -Manfred |
11-11-2003, 11:49 PM | #2 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Thanks for the comments.
For points one through three, I've made the appropriate changes for version 1.6.4 of PhpDig. For point two, I've added an option in the config file whether to use the is_executable command. For point three, I've added an option in the config file to set the length of the temp filenames and set a check for uniqueness just in case. To your questions, are the files remaining in the temp directory empty? If so, in the robot_functions.php file find: PHP Code:
PHP Code:
PHP Code:
If the files remaining in the temp directory are not empty, what are the file extensions and what external binaries are you using? For point four, is it that space is changed to %20 in the search results, or where do you see this? If you could, please post your mod to give me a better idea of what you mean.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
11-12-2003, 07:58 AM | #3 |
Orange Mole
Join Date: Nov 2003
Posts: 42
|
Great support, thanks!
Sorry about that typo/copy&paste error. Yes those temp files are empty. I'll implement your suggested patch and see what happens. This space conversion thing is exactly what you said. In search results it would be nice to have all those names without %20s. This is just minor thing but hey, why not be perfect if it is easy to correct! Here is something I have used - this is not right way to do it but it works. In robot_function.php phpdigUpdSpiderRow insert these lines right after the function: $path=ereg_replace('%20*',' ',$path); $file=ereg_replace('%20*',' ',$file); and also replace this $titre_resume = $file; to $titre_resume = ereg_replace('%20*',' ',$file); As you may guess this has some side effects. when spidering first round error will be seen in Apache log but in second round it finds all folders and files with spaces. In browser side this is not a problem because it converts all spaces back to %20s. Manfred |
11-12-2003, 10:42 AM | #4 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. I'm thinking that if the %20s are not wanted in the displayed search results, the search_function.php file could be modifying to have the displayed text without the %20s but the links themselves could keep the %20s. I think several browsers convert spaces back to %20, but aren't there browsers out there that don't do this? Maybe instead try the following:
In search_function.php, find: PHP Code:
PHP Code:
PHP Code:
PHP Code:
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
11-12-2003, 01:15 PM | #5 |
Orange Mole
Join Date: Nov 2003
Posts: 42
|
Awesome response speed!!! Why commercial product support does not work like this?
Good news first! You are absolutely right about the compatibility issue which cannot be compromised. Btw your patch works like a dream. Then some clarification about external binary usage in Windows. What I posted earlier was a cure for antiword but pdftotxt did not like it. It seems to be so that numbers are not recognized in name extension at all. So I removed '.2' from all lines in $command definitions. Then there is a part of code that I don't understand at all. What is the meaning of PHP Code:
After commenting out those lines also pdf documents can be spidered. And $suffix can be the default one. Maybe there is already threads about these issues but I couldn't find any. Hopefully this will help others to solve problems in Windows. |
11-12-2003, 02:22 PM | #6 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Edit: Note the below is for version 1.6.3 only.
Thanks, but my response speed is not always this fast. To answer your questions, let's assume the following: PHP Code:
This all works fine because catdoc is sending output to STDOUT, and it is this STDOUT output that is contained in the $result variable. Now let's consider pdftotext. When doc.pdf is crawled abcdef.tmp is formed. Then abcdef.tmp is renamed to abcdef.tmp2 (rename($tempfile,$tempfile.'2');) and then doc.pdf is converted to text (exec($command,$result,$retval);), sticking the results in $result (an array) and returning $retval (success or failure). As before, on success, !$retval is true so $result gets some work done on it and is written back to abcdef.tmp and abcdef.tmp2.txt is returned from the last switch statement in the function. Again, the unlink($tempfile.'2'); deletes the abcdef.tmp2 file. This doesn't work fine because pdftotext doesn't send output to STDOUT, but rather sends output to a file called abcdef.tmp2.txt leaving $result empty (note that 2.txt is the value of PHPDIG_PDF_EXTENSION). Hence, when $result is written back to abcdef.tmp, the abcdef.tmp file is empty. The reason for adding count($result) into the if statement is to prevent the writing of the empty file. On other OS it should work the same, so if output is written to STDOUT, then the following can be left empty: PHP Code:
PHP Code:
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
11-17-2003, 12:06 PM | #7 |
Orange Mole
Join Date: Nov 2003
Posts: 42
|
Version 1.6.4 solved all problems mentioned above.
Great work Charter! |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
RSS version? | AllKnightAccess | Troubleshooting | 2 | 09-27-2004 01:06 AM |
Corrections for Version 1.8.1 | Charter | Feedback & News | 3 | 07-12-2004 05:37 PM |
Next version? | tazmandev | Mod Requests | 1 | 03-09-2004 12:59 PM |
Bugs, and missing Features in V. 1.6.2 | Rolandks | Bug Tracker | 4 | 01-23-2004 08:01 AM |
Some ideas (in french) for synonyms & Aptness(?) | fr :: anonymus | Mod Requests | 1 | 12-08-2003 04:09 PM |