Show Posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.


Messages - Lee

Pages: [1]
1
Ultra Hal 7.0 / Let Hal Learn to its fullest concept!
« on: February 25, 2009, 02:45:03 pm »
Here goes my first post on this forum.  I was curious if any of you are using regular expressions to filter out some of the noise from Wikipedia. I'm having some success in cleaning up the Wiki verbiage.
The function returns true if certain patterns that I deem to be noise are discovered. One can add patterns as easily as adding elements to the array.

Function Noisy(strLine)
    Dim re, m
    Dim PatArray(7)
    Dim InNdx

    PatArray(0) = "*http:"
    PatArray(1) = "^^ [w]*"
    PatArray(2) = "[edit] [w]*"
    PatArray(3) = "^d [w]*"
    PatArray(4) = "Categories: [w]*"
    PatArray(5) = "[w]*wikipedia"
    PatArray(6) = "See also:"
    PatArray(7) = "Main article:"
   

    Set re = New regExp
    For InNdx = LBound(PatArray) To UBound(PatArray)
         re.Pattern = PatArray(InNdx)
         re.MultiLine = True
         re.Global = True
         re.IgnoreCase = True
         For Each m In re.Execute(strLine)
            Noisy = True
            Exit Function
         Next
     Next
     Noisy = False
End Function

Pages: [1]