# Generating Regular Expressions from Examples

It was rather surprising that such a simple approach can go this far, and this case study should be shared in the community. – Reviewer 1

Synthesizing Regular Expressions from Examples for Introductory Automata Assignments
Mina Lee*, Sunbeom So*, Hakjoo Oh
GPCE 2016
Best Paper Award
paper · poster (english) · poster (korean) · code
proposal (for senior project)

## Background

Regular expressions?

A regular expression is a way to express a text pattern.

# *.txt

For instance, *.txt represents all text files. A wildcard notation * is used for any sequence of characters; hence, any files with the extension .txt regardless of their names are selected by the regular expression above. You may use it to find all text files in certain directories.

Due to the expressiveness of regular expressions, they are widely used for various text manipulation tasks such as text extraction, classification, etc.

Introductory automata assignments?

Regular expressions are taught in many computer science courses such as automata theory.

## a, b, c, … → 0, 1

In practice, all available characters more or less are used to construct regular expressions. However, in automata classes, students who learn regular expressions for the first time usually only deal with the binary alphabet (i.e. 0 and 1) for simplicity.

Program synthesis?

Program synthesis is the task of automatically generating a program that meets user intent.

## Motivation

This is a sample question in assignments for introductory automata courses.

Find a regular expression for the following language.
L = {w ∈ {0, 1}* | Strings have exactly one pair of consecutive 0s}

For students, they may not understand the meaning of the problem, do not know where to start, or give up after a few trials, not to mention they may come up with wrong answers.

Even for instructors, it takes a while to solve the problem. Moreover, they may wonder whether their regular expression is an optimal solution.

Scenario 1. When students do not understand the problem, instructors often use examples to clarify the meaning of the problem:

Student: I don’t understand what ‘one pair of consecutive 0s’ in the question means.

Instructor: It means ‘00‘ appears only one time in each string represented by the regular expression. For instance, correct strings would be ‘00‘, ‘1001′, ’010010′ and so on, whereas ’01’, ’11’, and ‘000’ would be wrong examples of the strings.

Scenario 2. Students often use examples and counterexamples to check whether their answers are correct.

## Problem Statement

In lieu of having instructors demonstrating the meaning of problems to each student or wondering if their answers are optimal ones, let’s come up with:

A method for synthesizing regular expressions
from a set of simple, basic examples given by users.

### 00, 1001, 010010 (O) 01, 11, 000 (X) ⇓ (0?1)*00(10?)*

In a nutshell, while regular expressions are widely used for lots of text applications, it is still hard for first-timers to construct them. Our method automatically finds an optimal regular expression from a set of examples they provide.

## Problem Formalization

Syntax

e → a ∈ Σ | ε | ∅ | e + e | e • e | e*

Regular expression problems

Students are given with a description of a regular language L. We assume that the description of the language is given by a pair (P, N) of example strings.

Positive examples: P ⊆ Σ*
Negative examples: N ⊆ Σ*

Positive examples are correct strings that must be included in the language, and negative examples are wrong strings that must be excluded from the language.

A regular expression problem: (P, N)

## Technique

Basic search algorithm

Our algorithm checks all regular expressions in the order of their simplicity until it finds a solution.

Starting from the initial (i.e. simplest) regular expression, it iteratively constructs regular expressions. Because this approach results in a huge search space, we need to effectively prune out the search tree to find a solution in a reasonable time.

Pruning the search space

1. Prune when further search does not produce any results
2. Prune when the current candidate is subsumed by a simpler candidate

(Please check the paper for more details!)