Zabaware Support Forums

Menu

Show posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Show posts Menu

Messages - Lee

#1
Ultra Hal 7.0 / Let Hal Learn to its fullest concept!
February 25, 2009, 02:45:03 PM
Here goes my first post on this forum.  I was curious if any of you are using regular expressions to filter out some of the noise from Wikipedia. I'm having some success in cleaning up the Wiki verbiage.
The function returns true if certain patterns that I deem to be noise are discovered. One can add patterns as easily as adding elements to the array.

Function Noisy(strLine)
   Dim re, m
   Dim PatArray(7)
   Dim InNdx

   PatArray(0) = "*http:"
   PatArray(1) = "^^ [w]*"
   PatArray(2) = "[edit] [w]*"
   PatArray(3) = "^d [w]*"
   PatArray(4) = "Categories: [w]*"
   PatArray(5) = "[w]*wikipedia"
   PatArray(6) = "See also:"
   PatArray(7) = "Main article:"
 

   Set re = New regExp
   For InNdx = LBound(PatArray) To UBound(PatArray)
        re.Pattern = PatArray(InNdx)
        re.MultiLine = True
        re.Global = True
        re.IgnoreCase = True
        For Each m In re.Execute(strLine)
           Noisy = True
           Exit Function
        Next
    Next
    Noisy = False
End Function