PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Mod Submissions

Reply
 
Thread Tools
Old 02-01-2006, 07:58 AM   #1
JCL
Green Mole
 
Join Date: Feb 2006
Location: FRANCE
Posts: 1
Indexing DOC with VBA

Hi all,

Here is my projet:
- Intranet with more than 200000 WORD documents.
- PHP interface (Easy-PHP) + phpDIG

DESCRIPTION

Problems:
1- Catdoc doesn't work fine with me, especially with various Word formats and accents
2- I do not index a web site but also a directory subtree. Each subdir contains 3000 files.

Solutions:
1a- I use Word instead of Catdoc to extract the text objects and store them into text files, with a VBA script.
1b- I had to modify the code to replace file.txt by .doc with the links
1c- the VBA script generates an index.html file linking all text extractions used by the spider.

2- I had to modify some constants to make the spider accept a number of files in Level 1 more than 200000

Performance:
1- the VBA text extractor takes roughly 2s per file
2- I used a 2x512Mo RAM to index the base. It took me more than 250 work hours.

The indexing time results are:
File number / Time (s)
1000 0,5
100000 4
150000 6
200000 10

THE VBA SCRIPT

Works better than catdoc but I don't warranty it for your purpose !

The aim of this script is to extract text objects from DOC files. Each directory contains a certain number of files. For instance, you can have "C:\DATA\shelf01, C:\DATA\shelf02, ..." with every directory containing 3000 files.
For each DOC file is created a TXT file.
In each directory is created a index.html file containing links to the extractions.

Microsoft does not recommend the use of Office to automate processes, because of the conversational requests (a dialog box asking you for something). You have to look at you computer to watch if it is suspended. In my db it appended twice. If it occurs, just have to shut down WORD by clicking on the cross.

The script takes care of password protected and rights of the files. Any protected file will not be indexed.

The script is partially fault tolerant, i.e., if the program crashes, you just have to relaunch it. It will start at the beginning of the current directory.


Best Regards,

Jean-Christophe LECOQ


Code:
Sub GenerateINDEXHTML()

Dim ShelfList
Dim currentDir As String, currentWORDFile As String, currentTXTFile As String, lastFileProcessed As String
Dim excludedDir As Variant, excludedFiles As Variant
Dim tempres As String
Dim FTPBasePath As String, LogFileName As String, LogErrorFileName As String, CurrentProcessFile As String
Dim ExcludeDocFileName As String
Dim arrFiles As Variant, arrDir As Variant
Dim NumFiles As Long, NumDir As Long
Dim CurrentOpenedFileID As Long, IndexFileID As Long, LogFileID As Long
Dim LogErrorFileID As Long, CurrentProcessFileID As Long, ExcludeDocFileID As Long
Dim TotaDirToExclude As Long, TotalFileToExclude As Long
Dim i, j, k
Dim reloadWORDmax As Long, reloadWORDcounter As Long

Dim dirToDo As Boolean, canProcessFile As Boolean
Dim debugflag As Boolean
Dim ProcessNow As Boolean
Dim oApp As Object
Dim oWord As Object, oTxt As Object


' Where the files are supposed to be stored
FTPBasePath = "c:\FTP\documents\"

' Log file name
LogFileName = "C:\FTP\conv_dir.log"

' Log Error file name
LogErrorFileName = "C:\FTP\conv_err.log"

' Log the current file that is processed
CurrentProcessFile = "C:\FTP\conv_cur.log"

' where the file to exclude of the process are
ExcludeDocFileName = "C:\FTP\conv_exclude.log"

' debug flag
debugflag = True

' reload winWORd every x documents
reloadWORDmax = 200
reloadWORDcounter = 0

NumDir = GesDirList(FTPBasePath, arrDir)

' open and read the list of files to exclude
TotalFileToExclude = 0
If ("" <> Dir(ExcludeDocFileName)) Then
    ExcludeDocFileID = FreeFile
    Open ExcludeDocFileName For Input As #ExcludeDocFileID  ' should test if the file exist
    ReDim excludedFiles(0 To 1000)  ' should redim if exceeding
    Debug.Print "Building excluded file list..."
    Do While Not EOF(ExcludeDocFileID)
        Input #ExcludeDocFileID, excludedFiles(TotalFileToExclude)
        TotalFileToExclude = TotalFileToExclude + 1
    Loop
    ReDim Preserve excludedFiles(0 To TotalFileToExclude)
    Close #ExcludeDocFileID
End If

' check if there is already a log file
TotaDirToExclude = 0
LogFileID = FreeFile
If ("" = Dir(LogFileName)) Then
    Open LogFileName For Output As #LogFileID
Else
    ' open the log file and read it to get directories
    Open LogFileName For Input As #LogFileID
    ReDim excludedDir(0 To 100)
    Debug.Print "Retrieving log file..."
    Do While Not EOF(LogFileID)
        Input #LogFileID, currentDir
        excludedDir(TotaDirToExclude) = currentDir
        TotaDirToExclude = TotaDirToExclude + 1
    Loop
    ReDim Preserve excludedDir(0 To TotaDirToExclude)
    Close #LogFileID
End If

' read the last file processed
If (True = debugflag) Then
    ' set the file in process
    Debug.Print "Process is set in debug mode."
    
    If ("" <> Dir(CurrentProcessFile)) Then
        CurrentProcessFileID = FreeFile
        Open CurrentProcessFile For Input As #CurrentProcessFileID
        Input #CurrentProcessFileID, lastFileProcessed
        Close #CurrentProcessFileID
        ProcessNow = False
    Else
        Debug.Print "No last file processed."
        ProcessNow = True
    End If
Else
    ' process file in any case
    ProcessNow = True
End If

Debug.Print "Let's go..."

' list of the shelfs
For i = 1 To NumDir
    
    currentDir = FTPBasePath & arrDir(i)
    Debug.Print "Processing " & currentDir
    
    ' is the directory excluded ?
    dirToDo = True
    If (0 <> TotaDirToExclude) Then
        For k = 0 To (TotaDirToExclude)
            If (0 <> InStr(excludedDir(k), currentDir)) Then
                Debug.Print "directory already done ! skipping..."
                dirToDo = False 
                Exit For
            End If
        Next
    End If
    
    If (True = dirToDo) Then
        NumFiles = GetFileList(currentDir, "*.doc", arrFiles)
        
        ' generate a new text file for index.html
        fileName = currentDir & "\index.html"
        IndexFileID = FreeFile
        Open fileName For Output As #IndexFileID
        
        Print #IndexFileID, "<!DOCTYPE HTML PUBLIC>"
        Print #IndexFileID, "<html><head><title>Liste</title></head>"
        Print #IndexFileID, "<BODY><TABLE>"

        If (NumFiles > 0) Then
            For j = 1 To NumFiles
                ' stopping the current WORD session if needed
                If (reloadWORDcounter >= reloadWORDmax) Then
                    ' quit WORD, without changing anything
                    oApp.Quit SaveChanges:=wdDoNotSaveChanges
                    Set oApp = Nothing
                    reloadWORDcounter = 0
                End If
                
                ' starting a new WORD session
                If (reloadWORDcounter = 0) Then
                    Set oApp = CreateObject("Word.Application")
                    oApp.Visible = False
                    oApp.DisplayAlerts = False
                    'disable spell and grammar checking
                    oApp.Options.CheckGrammarAsYouType = False
                    oApp.Options.CheckSpellingAsYouType = False
                    oApp.Options.CheckGrammarWithSpelling = False
                    ' do not update OLE links
                    oApp.Options.UpdateLinksAtOpen = False
                End If
                
                reloadWORDcounter = reloadWORDcounter + 1
                
                ' name of the word file
                currentWORDFile = currentDir & "\" & arrFiles(j)
                              
                ' build a txt equivalent file
                If (0 <> InStr(Left(Right(currentWORDFile, 4), 3), ".")) Then
                    currentTXTFile = LCase(Left(currentWORDFile, Len(currentWORDFile) - 4)) & ".txt"
                End If
                tempres = currentTXTFile
                Do
                    k = InStr(tempres, "\")
                    If (0 <> k) Then
                        tempres = Right(tempres, Len(tempres) - k)
                    End If
                Loop Until (0 = k)
                
                canProcessFile = True
                Err.Clear
                
                ' is the file excluded ?
                If (TotalFileToExclude > 0) Then
                    For k = 0 To (TotalFileToExclude)
                        If (0 <> InStr(excludedFiles(k), currentWORDFile)) Then
                            Debug.Print "the file " & currentWORDFile & " is excluded ! skipping..."
                            canProcessFile = False
                            Exit For
                        End If
                    Next
                End If
                
                ProcessNow = True
                
                ' can the process restart ?
                If (True = debugflag) Then
                    If (False = ProcessNow) Then
                        If (Len(lastFileProcessed) > 0) Then
                            ' is it the current file ?
                            If (0 <> InStr(lastFileProcessed, currentWORDFile)) Then
                                Debug.Print "Restarting process from " & currentWORDFile
                                ProcessNow = True
                            End If
                        Else
                            Debug.Print "No crash before, starting now."
                            ProcessNow = True
                        End If
                    Else
                        ' store the name of the current file in case of crash
                        CurrentProcessFileID = FreeFile
                        Open CurrentProcessFile For Output As #CurrentProcessFileID
                        Print #CurrentProcessFileID, currentWORDFile
                        Close #CurrentProcessFileID
                    End If
                End If
                
                If ((True = canProcessFile) And (True = ProcessNow)) Then
                    On Error GoTo OpenErrorHandler
                    Set oWord = oApp.Documents.Open(fileName:=currentWORDFile, ConfirmConversions:=False, _
                                                    AddToRecentFiles:=False, ReadOnly:=True, _
                                                    PasswordDocument:="JCisthebest")
                    
                    If (True = canProcessFile) Then
                        oWord.ShowGrammaticalErrors = False
                        CurrentOpenedFileID = FreeFile
                        Open currentTXTFile For Output As #CurrentOpenedFileID
                        Close #CurrentOpenedFileID
                        Set oTxt = oApp.Documents.Open(fileName:=currentTXTFile, AddToRecentFiles:=False)
                        oTxt.ShowGrammaticalErrors = False
                        oTxt.range.Text = oWord.range.Text
                        If (True = canProcessFile) Then
                            oApp.ActiveDocument.Save
                        End If 
                        
                        oWord.Close SaveChanges:=wdDoNotSaveChanges
                        oTxt.Close
                        Set oWord = Nothing
                        Set oTxt = Nothing
                        
                    End If 'canProcessFile (reading a file generates problem)
                End If 'canProcessFile (file excluded)
                        
                If (True = canProcessFile) Then
                     Print #IndexFileID, "<TR><TD><A href=" & tempres & _
                            ">" & arrFiles(j) & " converti en txt " & "</A></TD></TR>"
                End If 
                
            Next
        
        End If 
        
        Print #IndexFileID, "</TABLE></BODY></html>"
        Close #IndexFileID
        
        ' open log file in an append mode, to add a new shelf
        LogFileID = FreeFile
        Open LogFileName For Append As #LogFileID
        Print #LogFileID, currentDir
        Close #LogFileID
        
    End If 
Next

oApp.Quit
Set oApp = Nothing

Debug.Print "OVER !"
Exit Sub

OpenErrorHandler:
                ' append error in log file
                LogErrorFileID = FreeFile
                Open LogErrorFileName For Append As #LogErrorFileID
                Print #LogErrorFileID, Time & ": Error " & Err.Number & " on " & currentWORDFile
                canProcessFile = False
                Close #LogErrorFileID
                If (Err.Number = -2147023170) Then
                    ' critical error, relaod WINWORD
                    Debug.Print "critical error ! reloading WINWORD..."
                    reloadWORDcounter = reloadWORDmax
                End If
                Resume Next
End Sub

Function GesDirList(ByVal sPath As String, arrFiles) As Long
Dim NextReDim As Long, nFound As Long
Dim currentRep
NextReDim = 10
nFound = 0

ReDim arrFiles(0 To NextReDim)
If Right(sPath, 1) <> "\" Then sPath = sPath & "\"

currentRep = Dir(sPath, vbDirectory)
Do While currentRep <> ""
    If currentRep <> "." And currentRep <> ".." Then
        If (GetAttr(sPath & currentRep) And vbDirectory) = vbDirectory Then
            nFound = nFound + 1
            arrFiles(nFound) = currentRep
            If nFound >= NextReDim Then
                NextReDim = NextReDim + NextReDim
                ReDim Preserve arrFiles(0 To NextReDim)
            End If
        End If
    End If
    currentRep = Dir()
Loop
ReDim Preserve arrFiles(0 To nFound)
GesDirList = nFound
End Function

Function GetFileList(ByVal sPath As String, SearchStr As String, arrFiles) As Long
    Dim fileName As String
    Dim NextReDim As Long, nFound As Long, bAdd As Boolean
    Dim LCaseSearchStr As String
    LCaseSearchStr = LCase(SearchStr)
    NextReDim = 100
    
    ReDim arrFiles(0 To NextReDim)
    If Right(sPath, 1) <> "\" Then sPath = sPath & "\"
    nFound = 0
    fileName = Dir(sPath & SearchStr)
    Do While Len(fileName) > 0
        Select Case True
            Case fileName = "."
            Case fileName = ".."
            Case Else
                nFound = nFound + 1
                arrFiles(nFound) = fileName
                If nFound >= NextReDim Then
                    NextReDim = NextReDim + 100
                    ReDim Preserve arrFiles(0 To NextReDim)
                End If
        End Select
        fileName = Dir()
    Loop
    ReDim Preserve arrFiles(0 To nFound)
    GetFileList = nFound
End Function
JCL is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
index only *.doc files ? ipguy Troubleshooting 1 01-16-2006 04:45 PM
indexing for the 1st time but getting "duplicate of existing doc" msg with some files Morphea Troubleshooting 9 12-30-2004 04:03 PM
problem with .pdf and .doc files mleray External Binaries 11 12-09-2004 11:26 PM
No short description from .doc? Spider External Binaries 1 09-06-2004 03:25 AM
For dummies: How can i index word doc?? dapuse External Binaries 3 01-27-2004 03:09 PM


All times are GMT -8. The time now is 08:32 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.