1
Ultra Hal 7.0 / Let Hal Learn to its fullest concept!
« on: February 25, 2009, 02:45:03 pm »
Here goes my first post on this forum. I was curious if any of you are using regular expressions to filter out some of the noise from Wikipedia. I'm having some success in cleaning up the Wiki verbiage.
The function returns true if certain patterns that I deem to be noise are discovered. One can add patterns as easily as adding elements to the array.
Function Noisy(strLine)
Dim re, m
Dim PatArray(7)
Dim InNdx
PatArray(0) = "*http:"
PatArray(1) = "^^ [w]*"
PatArray(2) = "[edit] [w]*"
PatArray(3) = "^d [w]*"
PatArray(4) = "Categories: [w]*"
PatArray(5) = "[w]*wikipedia"
PatArray(6) = "See also:"
PatArray(7) = "Main article:"
Set re = New regExp
For InNdx = LBound(PatArray) To UBound(PatArray)
re.Pattern = PatArray(InNdx)
re.MultiLine = True
re.Global = True
re.IgnoreCase = True
For Each m In re.Execute(strLine)
Noisy = True
Exit Function
Next
Next
Noisy = False
End Function
The function returns true if certain patterns that I deem to be noise are discovered. One can add patterns as easily as adding elements to the array.
Function Noisy(strLine)
Dim re, m
Dim PatArray(7)
Dim InNdx
PatArray(0) = "
PatArray(1) = "^^ [w]*"
PatArray(2) = "[edit] [w]*"
PatArray(3) = "^d [w]*"
PatArray(4) = "Categories: [w]*"
PatArray(5) = "[w]*wikipedia"
PatArray(6) = "See also:"
PatArray(7) = "Main article:"
Set re = New regExp
For InNdx = LBound(PatArray) To UBound(PatArray)
re.Pattern = PatArray(InNdx)
re.MultiLine = True
re.Global = True
re.IgnoreCase = True
For Each m In re.Execute(strLine)
Noisy = True
Exit Function
Next
Next
Noisy = False
End Function