Synthesizing Regular Expressions from Examples

It was rather surprising that such a simple approach can go this far, and this case study should be shared in the community. – Reviewer 1


You can find relevant materials here:

  • Publication [pdf]
  • Demonstration [web]
  • Project Proposal [pdf]
  • Poster
  • Code [web]
AlphaRegex

AlphaRegex: Automatic Synthesizer for Regular Expressions from Examples


Background


For those who do not have background in computer science, I wrote brief descriptions for a few terms to help you understand the contents.


Regular expressions?

A regular expression is a way to express a text pattern.

*.txt

For instance, *.txt represents all text files. A wildcard notation * is used for any sequence of characters; hence, any files with the extension .txt regardless of their names are selected by the regular expression above. You may use it to find all text files in certain directories.

Due to the expressiveness of regular expressions, they are widely used for various text manipulation tasks such as text extraction, classification, etc.


Introductory automata assignments?

Regular expressions are taught in many computer science courses such as automata theory.

a, b, c, … → 0, 1

In practice, all available characters more or less are used to construct regular expressions. However, in automata classes, students who learn regular expressions for the first time usually only deal with the binary alphabet (i.e. 0 and 1) for simplicity.


Program synthesis?

Program synthesis is the task of automatically generating a program that meets user intent.

More details available here.


Motivation


This is a sample question in assignments for introductory automata courses.

Find a regular expression for the following language.
L = {w ∈ {0, 1}* | Strings have exactly one pair of consecutive 0s}

boy-asking-question-1

For students, they may not understand the meaning of the problem, do not know where to start, or give up after a few trials, not to mention they may come up with wrong answers.

Even for instructors, it takes a while to solve the problem. Moreover, they may wonder whether their regular expression is an optimal solution.


Use of examples

Case 1. When students do not understand the problem, instructors often use examples to clarify the meaning of the problem:

Student: I don’t understand what ‘one pair of consecutive 0s’ in the question means.

Instructor: It means ‘00‘ appears only one time in each string represented by the regular expression. For instance, correct strings would be ‘00‘, ‘1001′, ’010010′ and so on, whereas ’01’, ’11’, and ‘000’ would be wrong examples of the strings.

Case 2. Students often use examples and counterexamples to check whether their answers are correct.


Problem Statement


In lieu of having instructors demonstrating the meaning of problems to each student or wondering if their answers are optimal ones, let’s come up with:

A method for synthesizing regular expressions
from a set of simple, basic examples given by users.

00, 1001, 010010 (O)
01, 11, 000 (X)

(0?1)*00(10?)*

In a nutshell, while regular expressions are widely used for lots of text applications, it is still hard for first-timers to construct them. Our method automatically finds an optimal regular expression from a set of examples they provide.


Problem Formalization


Syntax

e → a ∈ Σ | ε | ∅ | e + e | e • e | e*


Regular expression problems

Students are given with a description of a regular language L. We assume that the description of the language is given by a pair (P, N) of example strings.

Positive examples: P ⊆ Σ*
Negative examples: N ⊆ Σ*

Positive examples are correct strings that must be included in the language, and negative examples are wrong strings that must be excluded from the language.

A regular expression problem: (P, N)


Technique


Basic search algorithm

Our algorithm checks all regular expressions in the order of their simplicity until it finds a solution.

screen-shot-2016-06-06-at-2-36-51-pm

Starting from the initial (i.e. simplest) regular expression, it iteratively constructs regular expressions. Because this approach results in a huge search space, we need to effectively prune out the search tree to find a solution in a reasonable time.


Pruning the search space

screen-shot-2016-06-06-at-2-40-40-pm

  1. Prune when further search does not produce any results
  2. Prune when the current candidate is subsumed by a simpler candidate

(Please check the paper for more details!)

Computer-Aided Education

In most cases, there are far fewer instructors than students. Working as a teaching assistant for Basics of Computer Systems at University of British Columbia, I remember that my colleagues and I spent great amount of time in grading assignments submitted by hundreds of students and had hard time giving each student qualitative feedback every week. I believe program synthesis can play a great role in that it can automatically generate and solve problems (even with difficulty customized to each student) as well as provide detailed, tailored feedback to students.

Herewith, I leave my super subjective notes and comments on each material to remember and easily refer to them at any point in the future. Hence, this page is constantly updated as I continue exploring the field of programming languages, especially program synthesis in the context of computer-aided education!

DISCLAIMER: This post does not claim to be comprehensive or exhaustive :)


General Guide to Computer-Aided Education


A great starting point to get a sense of what computer-aided education is.

Example-Based Learning in Computer-Aided STEM Education
CACM 2014, Sumit Gulwani

  • Types of Problems
    • Procedural Problems
      • Require students to follow a specific procedures
      • Students are expected to memorize and apply concepts
      • Examples: mathematical procedures (addition, greatest common divisor computation, etc.) and algorithmic procedures (breadth-first search, insertion sort, Dijkstra’s shortest-path algorithm, regular expression to automaton conversion, etc.)
    • Conceptual Problems
      • Require students creative thinking
      • Students are expected to give educated guesses
      • Examples: proof problems (natural deduction proofs, proofs of algebraic theorems, proofs of non-regularity of languages, etc.) and construction problems (geometric constructions, automata, algorithmic procedures, bit vector circuits, etc.)
  • Example-Based Learning
    • Problem Generation
    • Solution Generation
      • Automatically generating solutions when given a problem description in some subject domain
      • Usages: generate solutions for automatically generated problems, complete students’ incomplete solutions, generate hints
      • Procedural problems
        • Writing down and executing the corresponding procedure for a given problem
          • Programming by examples (PBE)
            • Mostly applied to end-user applications
            • Examples: spreadsheet tasks including string transformations and table layout transformations (it can be framed as the problem of learning mathematical procedures from example traces).
      • Conceptual problems
        • Searching over the underlying solution space
          • Perform reasoning over examples (as opposed to abstract symbolic reasoning)
          • Reduce solution space to solutions with small length
          • Examples
            • Geometry constructions
              • Intuition: regard geometry constructions as straight-line programs that manipulate geometry objects
              • Goal: synthesize a straight-line program that realizes the relational specification between inputs and outputs
              • Approach: generate random input-output examples and perform brute-force search to find a construction
            • Natural deduction proofs
    • Feedback Generation

More emphasis on computing education research (CER)

The Importance of Computing Education Research
CoRR 2016, Steve Cooper, Jeff Forbes, Armando Fox, Susanne E. Hambrusch, Andrew Ko, Beth Simon


Automated Feedback Generation

Automated Feedback Generation for Introductory Programming Assignments
PLDI 2013, Rishabh Singh, Sumit Gulwani, Armando Solar-Lezama

  • Contributions
    • Reduced the problem of finding minimal corrections in an incorrect program to the problem of synthesizing a correct problem from a sketch (check)
    • Defined an error model language to ease the burden of providing correction rules
  • Evaluation
    • Appropriate corrections: 64%
    • Average time: 10 seconds
    • Limitations
      • Unable to deal with completely incorrect answers, big conceptual errors, unimplemented features, timeout, structural requirements, etc.
      • Diversity of different answers are not taken care of (check)
      • Require instructors to provide corrections as well as the bound for input size, the number of loop unrolling, and recursion depth.
  • Approach
    • System
      • Prerequisites: reference implementation and correction rules
      • Input: students’ answers
      • Output: feedback with minimum corrections
    • Techniques
      • Program Rewriter
        • Input: a program and correction rules
        • Output: a program with set-expressions and set-statements (potential corrections applied with cost model)
      • Sketch Translator
        • Input: output from the previous step
        • Output: programs with holes
        • Used Sketch to leave parts of programs unspecified
      • Sketch Solver
        • Input: output from the previous step
        • Output: a program with minimum corrections (check)
        • Used an incremental solving algorithm CEGISMIN to solve minimization problem
      • Feedback Generator
        • Input: output from the previous step
        • Output: feedback in natural language
        • Associated each correction rule with a feedback message

Automated (Partial) Grading

Thanks to Loris D’Antoni , I was able to get more information about automated grading on basic problems of concepts in the theory of computation as well as Automata Tutor, the actual implementation of the work with web-based interfaces for end-users. After I received a reply from him this afternoon, I spent the rest of the day reading his follow-up papers, and I must say, I truly enjoyed the time.  I am really grateful to his work and appreciate his endeavor to make the service online. You can find the original paper on the work as well as follow-up papers listed below at his website.

Automated Grading of DFA constructions
IJCAI 2013, Rajeev Alur, Loris D’Antoni, Sumit Gulwani, Dileep Kini, Mahesh Viswanathan

How Can Automatic Feedback Help Students Construct Automata?
TOCHI 2015, L. D’Antoni, D. Kini, R. Alur, S. Gulwani, M. Viswanathan, B. Hartmann

Automata Tutor and What We Learned from Building an Online Teaching Tool
Bulletin of EATCS, L. D’Antoni, M. Weaver, A. Weinert, R. Alur

  • Functionalities
    • Automatic grading
      • DFA constructions
      • NFA constructions (naive)
      • NFA to DFA transformation (naive)
      • Regular expression constructions (naive)
      • Pumping lemma constructions (under development)
    • Personalized feedback
      • Binary feedback
      • Counterexample
      • Hints
    • Course management
      • Creating problems, problem sets, and courses
      • Posting assignments
      • Collecting grades
  • Case Study
    • Goal
      • Determine if feedback can improve learnability of DFA constructions
      • Identify strength and weakness of different feedback types
    • Setting
      • Participants: 377 students
      • Method: Automata Tutor
        • Part A: 4 problems with no restriction on the number of attempts
        • Part B: 1 problem with only one attempt
      • Feedback: randomly assigned among binary feedback, counterexamples, and hints
    • Results
      • Total number of attempts: 3,085
      • Average attempts: 3.3
      • Average points: 3.53 / 5
      • Overall evaluation
        • Binary feedback: least preferred
        • Counterexamples: most preferred
          • + Serve as a specific test case
        • Hints
          • + Provide a general idea of what is wrong
          • – Can be confusing, complex, or verbose
  • Future Work
    • Regular expressions
      • Define new metrics and feedback tailored for regular expressions
    • Automata meta-constructions
      • Able to use Coq to effectively grade complex assignments 
      • Example: Given two DFAs, define their intersection
    • Proof of non-regularity
    • Proof of DFA correctness
    • Context-free languages
    • Turing machines
    • MOOC deployment
  • Summary
    • Contributions
      • Classified common mistakes made when constructing DFAs and suggested three metrics for measuring how far off students’ solutions are from the correct answer
      • Extended Monadic-Second Order (MSO) logic to describe regular languages easily and implemented algorithms for transforming it to DFAs and visa versa
    • Evaluation
      • Total attempts: 800 attempts in total
      • Results: tool graded better than two instructors; no mistake for correctness, standard scheme of grading, and consistency
      • Limitations
        • Fail to grade some particular examples
        • Not scalable for DFAs with large alphabets or many states
    • Approach
      • System
        • Prerequisites: description of regular languages (i.e. problems)
        • Input: student attempts
        • Output: (partial) grade with feedback
      • Technique
        • Algorithm for grading DFA constructions
          • Problem syntactic distance
            • Technique: synthesize a logic description of a DFA, represent the MOSEL formula as an ordered tree, and compute tree edit distance with the solution’s ordered tree
            • Feedback: English description of a DFA using the logic description
          • Solution syntactic difference
            • Technique: use regular language density to compare the size of the difference between two languages when viewed as sets
            • Feedback: English description of misclassified strings or counterexamples
          • Problem semantic difference
            • Technique: use DFA edit distance which reflects the smallest number of edits (edit script) necessary to make the given DFA into a correct DFA
            • Feedback: English description of how to fix using the edit script

Conferences


Intelligent Tutoring Systems (ITS) 2016

The Intelligent Tutoring Systems (ITS) 2016 conference is the 13th in an ongoing series of top-flight international conferences on the use of advanced computer technologies and interdisciplinary research for enabling, supporting and enhancing human learning.

The theme of the conference is “Adaptive Learning in real world contexts“. It wishes to stress the need for devising learning systems that can adapt adequately to users, furnishing them the knowledge they are seeking in real world contexts, i.e. systems that are effectively usable in everyday learning situations such as courses in schools or training programs in companies, but also in informal situations such as Web or App provided help in using new technologies. The above theme encourages conference participants to think about this educational need in our increasingly complex everyday world.

Topics of interest to the conference include, but are not limited to:

  • Intelligent tutoring
  • Informal learning environments, learning as a side effect of interactions
  • Collaborative and group learning, communities of practice and social networks
  • Simulation-based learning and serious games
  • Dialogue and discourse during learning interactions
  • Co-adaptation between technologies and human learning
  • Ubiquitous and mobile learning environments
  • Empirical studies of learning with technologies, understanding human learning on the Web
  • Adaptive support for learning, models of learners, diagnosis and feedback
  • Modeling of motivation, metacognition, and affect aspects of learning
  • Recommender systems for learning
  • Virtual pedagogical agents and learning companions
  • Ontological modeling, semantic web technologies and standards for learning
  • Multi-agent and service oriented architectures for learning and tutoring environments
  • Educational exploitation of data mining and machine learning techniques
  • Instructional design principles or design patterns for educational environments
  • Authoring tools and development methodologies for advanced learning technologies
  • Domain-specific learning technologies, e.g. language, mathematics, reading, science, medicine, military, and industry.
  • Non conventional interactions between artificial intelligence and human learning
  • Privacy and security in e-learning environments

Innovation and Technology in Computer Science Education (ITiCSE 2016)

ITiCSE 2016, the 21th Annual Conference on Innovation and Technology in Computer Science Education, will normally address aspects of computing education, that is, the education of students who are studying computing.

Within the domain of computing and computing engineering education, papers might cover:

  • Specific educational subject matter, such as programming, database systems, or computer security
  • Specific groups of students, such as women, minorities, or school students
  • Broader topics, such as curriculum, groupwork, or class infrastructure

Computer-Supported Cooperative Work and Social Computing (CSCW 2017)

The 20th ACM Conference on Computer-Supported Cooperative Work and Social Computing (CSCW 2017) is the premier venue for presenting research in the design and use of technologies that affect groups, organizations, communities, and networks. Bringing together top researchers and practitioners from academia and industry, CSCW explores the technical, social, material, and theoretical challenges of designing technology to support collaborative work and life activities.

World Conference on Computers in Education (WCCE 2017)

The World Conference on Computers in Education (WCCE 2017) is organized by the International Federation for Information Processing and hosted by the Irish Computer Society. The conference theme, “Tomorrow’s Learning: Involving Everyone”, reflects the long-standing commitment to both learning and technology that has concerned members of the Technical Committee 3 (TC3) and its working groups of the International Federation for Information Processing (IFIP).

The conference will focus on the latest uses of technologies, on computing and technology learning resources, highlighted educational practice, thought-leadership for the future and cutting edge educational research. This event will bring together professionals and novices, experienced teachers and learners, to dream about and discuss future learning environments.

Artificial Intelligence in Education (AIED 2017)

The 18th International Conference on Artificial Intelligence in Education (AIED 2017) is the next in a longstanding series of biennial international conferences for high quality research in intelligent systems and cognitive science for educational computing applications. Empirical and theoretical papers are solicited for submission.

Program Synthesis

My penchant for program synthesis stems largely from, I believe, its enormous applicability to other fields such as biology, database, data mining, human-computer interaction, machine learning, security, software engineering, and so much more.

Herewith, I leave my super subjective notes and comments on each material to remember and easily refer to them at any point in the future. Hence, this page is constantly updated as I continue exploring the field of programming languages, especially program synthesis!

DISCLAIMER: This post does not claim to be comprehensive or exhaustive :)


What is Program Synthesis?


Program Synthesis, Microsoft Research

Program synthesis is the task of automatically discovering an executable piece of code given user intent expressed using various forms of constraints such as input-output examples, demonstrations, natural language, etc. Program synthesis has direct applications for various classes of users in the technology pyramid: end users, students and teachers, software developers, and algorithm designers.

Program Synthesis, Wikipedia

Program synthesis is a special form of automatic programming that is most often paired with a technique for formal verification. The goal is to construct automatically a program that provably satisfies a given high-level specification. In contrast to other automatic programming techniques, the specifications are usually non-algorithmic statements of an appropriate logical calculus.


General Guide to Program Synthesis


A great starting point to get a sense of what program synthesis is.

Dimensions in Program Synthesis
PPDP 2010, Sumit Gulwani

  • User Intent
    • Logical Specifications
      • + Precise and succinct
      • – Difficult for end-users
    • Input-Output Examples
      • + Simple and less chance of error
        (with interaction, possible to achieve full functional specification)
      • – Hard to decide selection criterion and the number of examples
    • Traces
      • + More informational and sometimes easier than input-output examples
      • + Easier for synthesis
      • – Users need to know how to perform the task
    • Natural Language
      • + Best for end-users
      • – Ambiguous
        (with interaction, possible to resolve ambiguity)
  • Search Space
    • Logic
      • + Precise and succinct
      • Good for program synthesis
    • Grammars
      • + Has many applications
        (e.g. regular expressions, DFAs, NFAs, context-free grammars, etc.)
    • Programs
      • – Large search space
      • Restrict the choices of
        • Operators
        • Control structure
  • Search Technique
    • Brute-force Search
      • + Simplest
      • – Too expensive, requiring optimizations and pruning search space
      • Applications: functional program (a system that searches for desired functional programs by generating type-correct programs and checking them whether they satisfy given specifications)
    • Version Space Algebra
      • + Relatively simple, efficient
      • Lau et.al. extended Mitchell’s version space to version space algebra
        • Basic idea of version space: to maintain a set of all possible answers and iteratively refine the set as more data becomes available
        • Basic idea of version space algebra: to build up a complex version space by composing simpler version spaces instead of defining a single, large version space
      • Applications: repetitive robot programs, shell scripts, text-editing programs, imperative Python programs, etc.
    • Machine Learning Based Techniques
      • Probabilistic Inference
        • Model a program as a graph and use belief propagation to infer states and/or instructions
        • Applications: program verification (inferring abstract states given the program instructions), program synthesis (inferring both concrete states and instructions that satisfy a given set of input-output examples), etc.
      • Genetic Programming
        • + Can be used when conventional techniques fail, able to provide approximate solutions
        • An evolutionary algorithm-based methodology to find computer programs tailored to an user-defined task
        • Applications: mutual exclusion algorithms, fixing bugs in imperative programs, etc.
    • Logical Reasoning Based Techniques
      • Constraint Generation: generating a logical constraint (i.e. synthesis constraint)
        • Invariant-based
        • Path-based
        • Input-based
      • Constraint Solving: reducing the generated constraint to a corresponding SAT/SMT constraint to let off-the-shelf constraint solvers solve the problem
        • Reducing Second-order Unknowns to First-order Unknowns
        • Universal Quantifier Elimination

More detailed introduction to inductive programming.

Inductive Programming Meets the Real World
CACM 2015, Sumit Gulwani, Jose Hernandez-Orallo, Emanuel Kitzelmann, Stephen Muggleton, Ute Schmid, Ben Zorn

Programming by Examples (and its applications in data wrangling)
Marktoberdorf Lecture Notes, Sumit Gulwani


A good summary of PBE-based technologies (in MS software) and industrial point of view.

Program Synthesis in the Industrial World: Inductive, Incremental, Interactive
SYNT 2016, Oleksandr Polozov, Sumit Gulwani, and the rest of the PROSE team

  • Programming by example (PBE)
    • A subfield of program synthesis whose specification on a target program is given as input-output examples (i.e. constraints on the output)
    • It has been existed since 1980s; however, it is only the last decade when mass-market applications appeared
  • Mass-market applications
    • FlashFill (2013): a system for automatic synthesis of string transformation scripts in spreadsheets
    • FlashExtract (2014): a system for automated data extraction from text files
    • FlashMata (2015): a framework for automatic generation of domain-specific inductive synthesizers
  • Challenges in deploying program synthesis for industrial applications
    • Ambiguity resolution
      • Problem: ambiguous input-output examples
        (users easily lose their trust after a few attempts)
      • Solution: ranking
        • Specified manually by a DSL designer
        • Learned from a dataset
    • Incremental synthesis
      • Problem: dealing with new examples
      • Solution: incremental synthesis
        • Given a set of solutions from the past user interactions and the new constraints for the same learning task, filter the set down to only those programs that satisfy the new constraints
        • Consider combining background processing, “fast-track” tactics, and novel data structures for storing and processing DSLs
    • Interaction
      • + Another option for ambiguity resolution
      • + One way of establishing trust in systems and creating predictive experiences
      • – Hard to define and implement problems
    • Performance-minded engineering
      • – Difficult to be responsive due to a huge search space

Some publications of technologies incorporated in MS software

FlashMeta: A Framework for Inductive Program Synthesis
OOPSLA 2015, Alex Polozov, Sumit Gulwani

FlashExtract: A Framework for Data Extraction by Examples
PLDI 2014, Vu Le, Sumit Gulwani

  • Contributions
    • Suggested a general synthesis framework for DSLs to extract relevant data from semi-structures documents using examples
    • Presented an interaction model for end-users to give examples
    • Designed an inductive synthesis algorithm
    • Allowed examples to be in any DSL for data extraction
  • Evaluation
    • Domains: text files, webpages, and spreadsheets
    • Benchmarks: 75 documents
    • Average number of examples: 2.36
    • Average time taken per field: 0.84 seconds

Automating String Processing in Spreadsheets using Input-Output Examples
POPL 2011, Sumit Gulwani


Manipulating recursive functions, data structures, etc.

Synthesizing Transformations on Hierarchically Structured Data
PLDI 2016, Navid Yaghmazadeh, Chris Klinger, Isil Dillig, Swarat Chaudhuri

Synthesizing Data Structure Transformations from Input-Output Examples
PLDI 2015, John Feser, Swarat Chaudhuri, Isil Dillig

  • Summary
    • User intent: input-output examples
    • Search space: functional programs
      (programs in a λ-calculus with algebraic types and recursion)
    • Search technique: brute-force search with optimizations
  • Contributions
    • Focused on recursive data structures to effectively prune out a search space
  • Evaluation
    • Problem set: 40 data structure manipulation tasks
    • Runtime: 0.43 seconds (median)
    • Number of examples: 5 or fewer (for 75% of the benchmarks)
  • Approach
    • System
      • Input: a set of input-output examples
      • Output: a minimal-cost closed program that satisfies the examples
    • Techniques
      • Inductive generalization
      • Deductive reasoning
      • Best-first enumerative search

Program Synthesis
Empowering Other Fields


Biology

Saurabh Srivastava
Founder and CEO of 20n


Database

Alvin Cheung
Assistant Professor in the Department of Computer Science & Engineering at the University of Washington


Data Mining

Using Program Synthesis for Social Recommendations
CIKM 2012, Alvin Cheung, Armando Solar-Lezama, Samuel Madden


Machine Learning

Unsupervised Learning by Program Synthesis
NIPS 2015, Kevin Ellis, Armando Solar-Lezama, Joshua B. Tenenbaum

sk_p: a neural program corrector for MOOCs
CoRR 2016, Yewen Pu, Karthik Narasimhan, Armando Solar-Lezama, Regina Barzilay

Latte: A Language, Compiler, and Runtime for Elegant and Efficient Deep Neural Networks
PLDI 2016, Leonard Truong, Rajkishore Barik, Ehsan Totoni, Hai Liu, Chick Markley, Armando Fox, Tatiana Shpeisman


Software Engineering

Synthesizing Framework Models for Symbolic Execution
ICSE 2016, Jinseong Jeon, Xiaokang Qiu, Jonathan Fetter-Degges, Jeffrey S. Foster, Armando Solar-Lezama