[Discuss] Regular expressions and Boolean logic

Daniel M German dmgerman at uvic.ca
Mon Jul 16 17:32:00 PDT 2007


 > I have the following problem:
 > 
 > CMake has the ability to make source distributions from all files in the
 > source tree excluding those specified with a regular expression for the
 > complete path+filename.
 > 
 > Let's suppose that I want to put foo.pdf and bar.pdf in the distribution, but
 > exclude all other *.pdf files.  One way is to list out all
 > the (numerous) specific *.pdf files in the source tree that should
 > be excluded, but that is a pain.
 > 
 > Is there a slick way to do it with regular expression syntax?  In particular
 > I need a regular expression that is the following Boolean combination of RE's:
 > 
 > /.*\.pdf$/ AND NOT /.*foo\.pdf$/ AND NOT /.*bar\.pdf$/
 > 
 > Can such a Boolean combination be done with regular expressions?  Basically
 > the infix operator "|" is the Boolean OR for regular expressions, but what
 > is the equivalent regular expression (if any) for the Boolean AND and
 > Boolean NOT?
 > 

Hi Alan,

It is possible to build regular expressions by using Finite State
machines (for any regular expression there exists a FSM that accepts
the same language). This is kind of an ugly example, but it is
doable.

First you need to get rid of the "ANDs", using de Morgan Laws, so it
turns out to something like this:

NOT ( (NOT PDF) OR foo.pdf OR bar.pdf)

FSMs can be combined to accept the union of two languages (OR), and
can be negated (to accept the complement of a language, the NOT).

Building a FSM to accept foo.pdf and bar.pdf is trivial. Building a
FSM for NOT PDF is relatively trivial.

Now, combine all finite states machines into a finite state
machine. This is equivalent to the OR in between them.

To accomplish the NOT turn any accepting state into a non-accepting
state and vice-versa. Now you are almost done: you have
non-determinist FSM that accepts the language you wanted.

Then turn this FSM into a regular expression. This resulting
expression might not be trivial!


For instance, for the sake of simplicity let us assume you want a
REGEXP for NOT (foo.pdf | bar.pdf)

A FSM for (foo.pdf | bar.pdf) relatively trivial. The negation will
yield something like this (as regexp):

^(|[^fb].*|f[^o].*|fo[^o].*| ... |foo\.pd[^f]|
          |b[^a].*|ba[^r].*| ... |bar\.pd[^f]
 )$

The expression is long, but it is doable.

I suggest a simpler way to build it is by enumerating each file that
belongs to the set:

file1|file2|file3|...

dmg


 > Alan
 >> __________________________
 >> Alan W. Irwin
 >> 
 >> Astronomical research affiliation with Department of Physics and Astronomy,
 >> University of Victoria (astrowww.phys.uvic.ca).
 >> 
 >> Programming affiliations with the FreeEOS equation-of-state implementation
 >> for stellar interiors (freeeos.sf.net); PLplot scientific plotting software
 >> package (plplot.org); the libLASi project (unifont.org/lasi); the Loads of
 >> Linux Links project (loll.sf.net); and the Linux Brochure Project
 >> (lbproject.sf.net).
 >> __________________________
 >> 
 >> Linux-powered Science
 >> __________________________
 >> _______________________________________________
 >> Discuss mailing list
 >> Discuss at vlug.org
 >> http://ladybug.vlug.org/cgi-bin/mailman/listinfo/discuss
 >> 


-- 
--
Daniel M. German                  
http://turingmachine.org/
http://silvernegative.com/
dmg (at) uvic (dot) ca
replace (at) with @ and (dot) with .


More information about the Discuss mailing list