Synthesizing Regular Expressions from Examples

It was rather surprising that such a simple approach can go this far, and this case study should be shared in the community. – Reviewer 1


You can find relevant materials here:

  • Publication [pdf]
  • Demonstration [web]
  • Project Proposal [pdf]
  • Poster
  • Code [web]
AlphaRegex

AlphaRegex: Automatic Synthesizer for Regular Expressions from Examples


Background


For those who do not have background in computer science, I wrote brief descriptions for a few terms to help you understand the contents.


Regular expressions?

A regular expression is a way to express a text pattern.

*.txt

For instance, *.txt represents all text files. A wildcard notation * is used for any sequence of characters; hence, any files with the extension .txt regardless of their names are selected by the regular expression above. You may use it to find all text files in certain directories.

Due to the expressiveness of regular expressions, they are widely used for various text manipulation tasks such as text extraction, classification, etc.


Introductory automata assignments?

Regular expressions are taught in many computer science courses such as automata theory.

a, b, c, … → 0, 1

In practice, all available characters more or less are used to construct regular expressions. However, in automata classes, students who learn regular expressions for the first time usually only deal with the binary alphabet (i.e. 0 and 1) for simplicity.


Program synthesis?

Program synthesis is the task of automatically generating a program that meets user intent.

More details available here.


Motivation


This is a sample question in assignments for introductory automata courses.

Find a regular expression for the following language.
L = {w ∈ {0, 1}* | Strings have exactly one pair of consecutive 0s}

boy-asking-question-1

For students, they may not understand the meaning of the problem, do not know where to start, or give up after a few trials, not to mention they may come up with wrong answers.

Even for instructors, it takes a while to solve the problem. Moreover, they may wonder whether their regular expression is an optimal solution.


Use of examples

Case 1. When students do not understand the problem, instructors often use examples to clarify the meaning of the problem:

Student: I don’t understand what ‘one pair of consecutive 0s’ in the question means.

Instructor: It means ‘00‘ appears only one time in each string represented by the regular expression. For instance, correct strings would be ‘00‘, ‘1001′, ’010010′ and so on, whereas ’01’, ’11’, and ‘000’ would be wrong examples of the strings.

Case 2. Students often use examples and counterexamples to check whether their answers are correct.


Problem Statement


In lieu of having instructors demonstrating the meaning of problems to each student or wondering if their answers are optimal ones, let’s come up with:

A method for synthesizing regular expressions
from a set of simple, basic examples given by users.

00, 1001, 010010 (O)
01, 11, 000 (X)

(0?1)*00(10?)*

In a nutshell, while regular expressions are widely used for lots of text applications, it is still hard for first-timers to construct them. Our method automatically finds an optimal regular expression from a set of examples they provide.


Problem Formalization


Syntax

e → a ∈ Σ | ε | ∅ | e + e | e • e | e*


Regular expression problems

Students are given with a description of a regular language L. We assume that the description of the language is given by a pair (P, N) of example strings.

Positive examples: P ⊆ Σ*
Negative examples: N ⊆ Σ*

Positive examples are correct strings that must be included in the language, and negative examples are wrong strings that must be excluded from the language.

A regular expression problem: (P, N)


Technique


Basic search algorithm

Our algorithm checks all regular expressions in the order of their simplicity until it finds a solution.

screen-shot-2016-06-06-at-2-36-51-pm

Starting from the initial (i.e. simplest) regular expression, it iteratively constructs regular expressions. Because this approach results in a huge search space, we need to effectively prune out the search tree to find a solution in a reasonable time.


Pruning the search space

screen-shot-2016-06-06-at-2-40-40-pm

  1. Prune when further search does not produce any results
  2. Prune when the current candidate is subsumed by a simpler candidate

(Please check the paper for more details!)

One Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s