|
04-03-2004, 01:24 AM | #1 |
Green Mole
Join Date: Mar 2004
Posts: 7
|
Speed up spidering by skipping internal page links
As I explained in this thread, spidering can be very slow due to the existence of (many) internal page links, such as the <A HREF="#1070721880">xxx</A> and <A NAME="1070721880"></A> pair. Since such links don't serve any purpose for the spider functionality, I suggest to skip spidering these links.
Peter |
04-10-2004, 04:42 PM | #2 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Perhaps try removing the # symbol from the two pieces of code shown in this post.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
04-14-2004, 02:37 AM | #3 |
Green Mole
Join Date: Mar 2004
Posts: 7
|
Charter,
Thanks for the reply, but unfortunately it didn't help. Any other ideas? |
04-14-2004, 04:36 AM | #4 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Did you do a fresh index or a reindex?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
04-15-2004, 02:54 AM | #5 |
Green Mole
Join Date: Mar 2004
Posts: 7
|
Both, with the same result.
|
04-15-2004, 03:45 PM | #6 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Keeping that removed # out, now try adding [^#]+
In while of phpdigExplore: ([^#]+(([[a-z]{3,5}:// In while of phpdigIndexFile: ([^#]+((http:// The <A NAME="1070721880"></A> shouldn't be matched regardless.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
04-18-2004, 08:17 AM | #7 |
Green Mole
Join Date: Mar 2004
Posts: 7
|
Charter,
It still doesn't do the job. Main reason is the size of the files I spider and the amount of HTML tags they contain. And since I have a genealogy site, adding newly found ancestors and their descendants will only lead to increasing file sizes. For the time being I have chosen to generate my genealogy files in two flavours. One with complete functionality and the other without any HTML tag. I use the latter for spidering and afterwards replace it with the correct one. I believe a final solution might be to spider locally and then copy the local database to the database on the remote server. But this will need some investigation . For now, thanks for your help. Peter |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
skipping header and footer | joelstein | How-to Forum | 1 | 10-19-2005 08:55 AM |
spidering external links | websearch | How-to Forum | 1 | 01-11-2005 09:39 AM |
not spidering all pages (too many links on page?) | mirdin | Troubleshooting | 2 | 09-01-2004 07:08 AM |
Anything to speed up spidering | jinkas | Mod Requests | 0 | 08-25-2004 03:07 PM |
crawling of only internal links? | manute | Troubleshooting | 1 | 06-19-2004 06:38 AM |