Author Topic: The start of a regex library for BC  (Read 3529 times)

Offline MLeo

  • Retired Staff
  • Posts: 3636
  • Cookies: 833
  • Software Simian
    • the Programming Pantheon
The start of a regex library for BC
« on: February 29, 2008, 07:51:54 PM »
Warning, this post is intended for BC scripters who do a lot of text processing.

So, wouldn't it not be nice to use Regular Expressions in BC?
To bad you can't import the module re (since the executable counterpart wasn't included in BC).

I shall be in a need for a regex library in BC in the future, so I took half a day (ought to have been spend on some other pressing matters :oops:) venturing into the land of regex.
Now, nearly all regex libraries use the built in regex library of Python, which we can't use.

This is only a start of a regex library, I hope to expand on it.

It's actually a port from this:
http://paste.lisp.org/display/24849
Which uses features not found in Python 1.5.2, so I had to port it to something BC can use.


So far, you can only match strings to a rule, and it will return an array with the "spil" (in terms of strings). Or an empty array if nothing matches.
But even that might not be perfect (run the example and you'll see).

Most of the basic regex operators are present.
They are all in the form of functions (yeah, no fancy text parsing just yet).
so a regex like this:
c(a|d)+r becomes:
seq(char('c'), seq(plus(alt(char('a'),char('d'))), char('r'))
Browny points for those who know what it matches. ;)


I've added the {m} and {m,n} operators to the original work (repe).
See the "documentation" and the included example for more info.

Before you ask, usefull things like classes (don't be fooled by the ( ), they are currently only notational), their references ($#) and ranges ([]) aren't supported.

Which is part of what I want to add.
I intend to place it in scripts/utils/parsing/ since I plan to add atleast another type of parser to BC, xml (yes, with fancy util.parsing.xml.parse("file.xml").root_element.subelement being a list of all those elements, and the first instance of that element).

If you want to try it out in a normal Python console (yes, this is possible, it has no BC dependancies), then rename it to _regex.py (or something other than regex or re).
You can add another directory to the "classpath" via:
Code: [Select]
import sys
sys.append("Drive:/path/to")
then you can do import _regex (notice there isn't a trailing / this appears to be important).


If you have questions or suggestions (other than a regex parser and class and range capabilities, feel free to add, and share, them though), feel free to ask.
I still can't read peoples minds, nor can I read peoples computers, even worse, I can't combine the two to read what is going wrong with your BC install...

"It was filed under 'B' for blackmail." - Morse, Inspector Morse - The dead of Jericho.

Offline MLeo

  • Retired Staff
  • Posts: 3636
  • Cookies: 833
  • Software Simian
    • the Programming Pantheon
Re: The start of a regex library for BC
« Reply #1 on: March 01, 2008, 12:06:24 PM »
And I already got you an update!

This time, you can do this:
Code: [Select]
from util.parsing.regex import match
for s in ["regex", "Regex", "qegex"]:
   if match("regex", "(r|R)egex"):
      print "It's regex!"
   else:
      print "Not regex :("
This will print:
Code: [Select]
It's regex!
It's regex!
Not regex :(

You can still use the fluent interface (method chaining) way.
This will produce the same output as above:
Code: [Select]
from util.parsing.regex import *
pred = longestMatch(seq(alt(char('r'), char('R')), seq(char('e'), seq(char('g'), seq(char('e'), char('x'))))))
for s in ["regex", "Regex", "qegex"]:
   if match("regex", pred):
      print "It's regex!"
   else:
      print "Not regex :("

match returns 1 (or true) if it matches, 0 otherwise.
If you use a textual pattern, then it will always do a longest match (least spill).

I've also added 4 other helper functions
longestMatch(expr) (least spill)
shortestMatch(expr) (most spill)
prettyPrint(expr [, indendationLevel]) (print the expression out in a nice tree structure
writeOut(expr) (print a textual representation of an expression, ie. as if you entered the expression textually).


Some other (minor) changes, replaced lambdas with inner functions, this makes debugging easier and makes the prettyPrint and writeOut possible. Basicly, each component (except those that work directly on strings (nil, iconcat), returns an inner function with its only identifying name. There is no (real) overhead in either way (except a minor one where I added another level).

I've also added a standard predicate, whitespacePattern, which will match (with possible spill) if a string starts with whitespace.
I've included the normal characters that the string.py lists as being whitespace.

If there are any suggestions, then I'd love to hear about them.
I still can't read peoples minds, nor can I read peoples computers, even worse, I can't combine the two to read what is going wrong with your BC install...

"It was filed under 'B' for blackmail." - Morse, Inspector Morse - The dead of Jericho.

Offline LJ

  • Retired Staff
  • Posts: 1661
  • Cookies: 1139
Re: The start of a regex library for BC
« Reply #2 on: March 01, 2008, 08:15:44 PM »
This is actually really really useful!  I wrote a very very simple chat bot a while back for the characters in Immersion, only to find out that cos there was no regex library in BC it could not be used.

Keep up the great work!  :mrgreen:

Offline MLeo

  • Retired Staff
  • Posts: 3636
  • Cookies: 833
  • Software Simian
    • the Programming Pantheon
Re: The start of a regex library for BC
« Reply #3 on: March 02, 2008, 07:41:44 AM »
Thanks!

I'm actually working on some other stuff that (ought) to be usefull. But they require some "sneaky" stuff. Like override reload, and making __getattr__, __setattr__ work via inheritance... >_>
Oh, and a "file" class, like the one built into later Pythons?

So, in a while, you can get a perfectly mirrored console (to an external file, after I finnish both the file class and a pipe class), and you get to use the file "method" everywhere you want.


[EDIT] Regarding the __getattr__ stuff, it might not be as possible as I thought. :(
self.__dict__ _must_ be of a type dictionary, so no lookalikes.
I still can't read peoples minds, nor can I read peoples computers, even worse, I can't combine the two to read what is going wrong with your BC install...

"It was filed under 'B' for blackmail." - Morse, Inspector Morse - The dead of Jericho.

Offline LJ

  • Retired Staff
  • Posts: 1661
  • Cookies: 1139
Re: The start of a regex library for BC
« Reply #4 on: March 02, 2008, 09:04:08 AM »

That all sounds great!!  BC really needs this.

I'm actually working on some other stuff that (ought) to be usefull. But they require some "sneaky" stuff. Like override reload, and making __getattr__, __setattr__ work via inheritance... >_>
What is wrong with the current implementation?  I use it to support flexible property on objects in Immersion and it worked fine. :-)

Offline MLeo

  • Retired Staff
  • Posts: 3636
  • Cookies: 833
  • Software Simian
    • the Programming Pantheon
Re: The start of a regex library for BC
« Reply #5 on: March 02, 2008, 02:21:33 PM »
Because I haven't been able to rely on it (consistency-wise). But that might have been a typo or something like that.
I still can't read peoples minds, nor can I read peoples computers, even worse, I can't combine the two to read what is going wrong with your BC install...

"It was filed under 'B' for blackmail." - Morse, Inspector Morse - The dead of Jericho.

Offline LJ

  • Retired Staff
  • Posts: 1661
  • Cookies: 1139
Re: The start of a regex library for BC
« Reply #6 on: March 02, 2008, 02:47:37 PM »
hmm that's strange.  you have a copy of the latest core don't you?  a working version is in that :-)

Offline MLeo

  • Retired Staff
  • Posts: 3636
  • Cookies: 833
  • Software Simian
    • the Programming Pantheon
Re: The start of a regex library for BC
« Reply #7 on: March 02, 2008, 02:58:59 PM »
I have. But I think I just need to read the correct sections of the 1.5.2 manual (I suppose it doesn't help when I only have a copy of 2.4 :P).
I still can't read peoples minds, nor can I read peoples computers, even worse, I can't combine the two to read what is going wrong with your BC install...

"It was filed under 'B' for blackmail." - Morse, Inspector Morse - The dead of Jericho.

Offline MLeo

  • Retired Staff
  • Posts: 3636
  • Cookies: 833
  • Software Simian
    • the Programming Pantheon
Re: The start of a regex library for BC
« Reply #8 on: March 09, 2008, 07:57:27 PM »
An update to the regex library.
This time ranges
This is now valid:
Code: [Select]
[ac-mz]Which will match:
a, c through m and z

Missing is the ^ "operator" in the range, to only match if it _isn't_ present.

The function is called matchRange, which will take a variable list (vararg) of either a single character (ie. "a"), or a tuple of 2 characters (ie. ("c", "m")).
As an example
Code: [Select]
matchRange("a", ("c", "m"), "z")

Right, now for things like $1 and such. And not to forget getting things out of this.
And maybe writing a re compatibility layer.



And, before I forget it, I grokked the 1.5.2 Python manual (library reference + language reference + module listing, this doesn't contain maims from the BC developers to the Python version in BC) from it's website (with wget no less), zip attached below.
I still can't read peoples minds, nor can I read peoples computers, even worse, I can't combine the two to read what is going wrong with your BC install...

"It was filed under 'B' for blackmail." - Morse, Inspector Morse - The dead of Jericho.

Offline MLeo

  • Retired Staff
  • Posts: 3636
  • Cookies: 833
  • Software Simian
    • the Programming Pantheon
Re: The start of a regex library for BC
« Reply #9 on: March 11, 2008, 06:24:30 PM »
The next iteration, split and replace.
How deviously easy there were to implement.
Just 10 minutes, tops. Especially replace was really easy to implement. Just a reduce over the result of split.

split is called with input and a pattern.
replace is called with input, a pattern and the text to replace it with.

Now, why would I want to do that instead of using string.split and string.replace?
Since I can pass a regex to the "sep" parameter with these!


Some other improvements (these took some more time ;)) is that you can now put an operator after every other character, and mostlikly also after each operator, but that results in a stackoverflow. :S So don't try this at home:
regex.parse("a?+")
Well, you can easily do that, just don't use it.

Anyway, here is the current regex library for BC.
I still can't read peoples minds, nor can I read peoples computers, even worse, I can't combine the two to read what is going wrong with your BC install...

"It was filed under 'B' for blackmail." - Morse, Inspector Morse - The dead of Jericho.

Offline MLeo

  • Retired Staff
  • Posts: 3636
  • Cookies: 833
  • Software Simian
    • the Programming Pantheon
Re: The start of a regex library for BC
« Reply #10 on: March 13, 2008, 07:52:30 PM »
A new, and most certainly, rough, addition.

Bindings!
Every ( ) becomes a bound "variable". And you can reference it (always after the fact!) through $# where # is the "index" number of bounds starting at 0.

So:
Code: [Select]
(abc|xyz)mn$0Will match:
Code: [Select]
abcmnabcand:
Code: [Select]
xyzmnxyzBut not:
Code: [Select]
xyzmnabc
You can even get stuff out of it.
If you pass a reference to a list (anything with append and subscript operator for integers) as a third argument, then it will be filled with the matchings. Each bind is a list of matches. This was needed because the code returns mutliple results for some operators, which is why the longestMatch function was introduced. So, be aware of this. Usually speaking, it's the first element of the list that is the closest (longest) match.

Nested binds haven't been tested yet.

I have also included 2 functions that return a single bind (getBound(input, pattern, index) and all binds (getBounds(input, pattern)).
But if you do call match(input, pattern, alistref) or p = regex.parse("a pattern") and then p("input", alistref) then it will also work.


I also fixed a bug in the ranges, where it really wasn't working properly.


So, for now (I have a bed with my name on it), I'll leave you with the current (rough) version of the regex library for BC.


So please, leave comments on how to improve this!
I'm especially interrested in the syntax of the parse function, and how the function names could be improved (or the writeback through writeOut/prettyPrint).
I still can't read peoples minds, nor can I read peoples computers, even worse, I can't combine the two to read what is going wrong with your BC install...

"It was filed under 'B' for blackmail." - Morse, Inspector Morse - The dead of Jericho.