|
10-03-2003, 05:56 AM | #1 |
Green Mole
Join Date: Oct 2003
Location: Netherlands
Posts: 5
|
Index update question
Hi, I recently found PhpDig serarching for a good site search engine for my remotely hosted website and I am currently configuring it to suit my needs. I still have a couple of questions, so I hope anyone here can help me along:
- A full index takes hours, and besides spidering and indexing all links, it also finds and indexes all files in the subdirectories. I just want it to spider the links, because the files itself are not complete. Served through the website, the files are embedded in php and css driven templates in order to serve complete html pages. Is there any way that I can tell PhpDig not to find and index files, but just to spider and index all links in html pages? - What happens when I embed links in the <!-- phpdigExclude --> and <!-- phpdigInclude --> tags? Are hyperlinks placed between those tags being ignored for spidering? Hope so! - Right now, PhpDig also indexes words and picks up links that are embedded in HTML remark tags <!-- and -->. Too bad. - Wouldn't it be an idea if you could configure PhpDig with a list of files and directories to ignore? Then the spider does not have to spider everything in order to find out that certain pages are not to be indexed when the META ROBOTS tag tells it. - Is there a possiblility that I can add seperate files to the index through the web interface? I have a news service on my site which is driven by a single php file. Right now it looks like that if I have to add new files to the index, I have to spider the entire news directory. This causes PhpDig to spider 900+ pages right now, and over 1200 next year etc.
__________________
-- Life is wasted on the living |
10-03-2003, 07:50 AM | #2 | |||||
Purple Mole
Join Date: Sep 2003
Location: Kassel, Germany
Posts: 119
|
Re: Index update question
Quote:
Quote:
Quote:
Quote:
Quote:
__________________
-Roland- :: Test PhpDig 1.6.2 here :: - :: Test-Search for (little) Intelligent Php-Dig Fuzzy :: |
|||||
10-03-2003, 09:37 AM | #3 | |
Green Mole
Join Date: Oct 2003
Location: Netherlands
Posts: 5
|
Re: Re: Index update question
Roland, thank you for your advice! I think this might just be the trick to speed up the spider process and avoid having to remove hundreds of files by hand every time.
I have one question about your answer on my first question, though: Quote:
In other words: i just want PhpDig to index the URL .../show.php?link=a (which incorporates a.htm) but I do not want PhpDig to index the a.htm file itself as it is no web page but just a part of it. Your suggestion to put Phpdig exclude and include brackets into a.htm would not work, because then the contents are also not indexed when the spider is trying to index show.php?link=a! If PhpDig spiders the site from the root URL, it should never encounter a.htm, just show.php?link=a. But it doesn't. It does not only spider the links and index the pages found that way, it also reads the remote filesystem and indexes every single file it finds. And that is not what I want it to do.
__________________
-- Life is wasted on the living |
|
10-05-2003, 10:56 AM | #4 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Not sure if I understand completely, but you might try setting the following in the config file:
PHP Code:
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
10-06-2003, 11:41 AM | #5 |
Green Mole
Join Date: Oct 2003
Location: Netherlands
Posts: 5
|
I tried this, but unfortunately also without the result I hoped for. Setting the spider depth to 1 causes only my index page and the links found there to be spidered (which is only 5% of the entire site, the rest of the links are found in the next two levels in the site's link tree structure).
Increasing the spider depth to 2 allowed more of my site to be spidered, and more important: no files like I mentioned in the original posting were found. But also still a number of sub-pages weren't being spidered. Increasing the spider depth one step further to 3 results in the entire site being spidered, but also in indexing all the files in the subdirectories involved which are no part of the link tree. Seems to be a bug in PhpDig?
__________________
-- Life is wasted on the living |
10-06-2003, 04:00 PM | #6 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. PhpDig is set to crawl any links it encounters at the given level. Not sure if "called by the php script" means that the PHP script is feeding the a.htm files via a.htm links. Does setting up a robots.txt file in web root so PhpDig doesn't crawl a.htm type files work?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
10-06-2003, 11:27 PM | #7 | |
Green Mole
Join Date: Oct 2003
Location: Netherlands
Posts: 5
|
Quote:
I already excluded the major part of these files by putting their directory as disallowed in robots.txt when the php-script is not located in the same directory as the *.htm snippets. This solves about 60% of the problem. But disallowing the remaining files by naming them directly in robots.txt is still a lot of work, and it is a workaround for a problem that should not exist in the first place. I still find it curious that when only HTML links are spidered by PhpDig, files are found when there is no HTML-link pointing to it. Previous I used search services like Atomz and Freefind (but they became too limited for my rapidly expanding site); they just spidered the links and nothing else.
__________________
-- Life is wasted on the living |
|
10-08-2003, 11:37 AM | #8 |
Green Mole
Join Date: Oct 2003
Location: Mesa, AZ
Posts: 15
|
I'm no phpDig expert (yet!), but I find it highly unlikely that phpDig is reading the remote filesystem. It should ONLY be able to find files (.html, .php or whatever) that are explicitly linked to by a visible index page. This could include a subdirectory that doesn't have an index page, but your webserver has directory listing enabled...
You must have a link somewhere that is pointing to these a, b, c.html files. I can't understand how phpDig would find them otherwise... I don't think it has a module for hacking into servers and reading filesystems |
10-09-2003, 11:16 PM | #9 | |
Green Mole
Join Date: Oct 2003
Location: Netherlands
Posts: 5
|
Quote:
Yesterday I put up a little experiment and put a dummy file in one of those subdirs, built up a new index, and ...... it appeared in the index. Next thing I will try is just upload a blank index page, perhaps that will do. Or does anyone know how to disable directory listing when no index file is present?
__________________
-- Life is wasted on the living |
|
10-11-2003, 05:35 AM | #10 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Perhaps make a .htaccess file with the following line
Options -Indexes and stick the file in the directory to prevent directory listings when no index page is present.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
I cannot update my website | humanitaire.ws | How-to Forum | 7 | 01-19-2005 10:00 AM |
Update Problem | Siava | Troubleshooting | 1 | 07-27-2004 01:06 PM |
google update | heiko | IPs, SEs, & UAs | 4 | 04-17-2004 05:10 PM |
Update Index taking 11 hours.... | tester | Troubleshooting | 14 | 01-23-2004 11:10 AM |
Update Documentation | Charter | Feedback & News | 6 | 01-19-2004 11:11 AM |