User Tools

Site Tools


haxx:examples:regex_clang

RegEx in C

Premise

Many of us have been there, we have input coming into a program, where its state can be nebulous. Input specifications may not be precise enough, or due to optional spaces, we cannot make use of the string tokenizer in all cases. What to do??

Anyone also possessing a modicum of shell experience may find themselves thinking “I could totally capture that into a variable via a command expansion to grep”, and we should take care to identify the situation:

  • we'd like the functionality of grep, in a C program
  • more specifically, we'd like to use a regex to parse and grab things

Gee… grep must be written in C, so perhaps there are some standard library functions available.

regex.h

Sure enough, there ARE regular expression functions available to us, and they can enable all sorts of great things that would otherwise make for very precarious processing.

Just as is the case with shell scripting: we shouldn't be the computer, we should let the computer do what it is good at.

There are some regex functions available to us in the C library:

  • regcomp()
  • regexec()
  • regerror()
  • regfree()

Which allow us to throw a regex (stored in a string) at a string of data to parse. And using things like regex groups, even pick out those isolated matches.

Sort of like the string tokenizer amped up.

Example

As an example, let's say we have input where the fields are comma delimited. Data in the fields consists of potentially a number (whole or decimal) and could also have a text component suffixed on.

Additionally, there could be any number of spaces padded before/after/between these two values.

data

For example, this:

4.86 foo , 5.12bar,2.54 baz,,  16 ,  .416p

And let's say we'd like to isolate the number (and treat it as a float), and isolate the text, each in their own variables.

If that seems like it could be an undertaking doing things as usual, you would not be mistaken.

regex

So now, we will see how using a regex, in combination with the regex functions, can make our lives a lot easier.

First up, the regex (which will describe and group the fields of data):

([0-9]*.?[0-9]*) *([A-Za-z]*)( *, *)*

Picking it apart:

  • ([0-9]*.?[0-9]*)
    • for one, this is a group (denoted by the parenthesis)
    • note that me have a character class describing any valid decimal number (0 or more of them).
    • and we have 0 or 1 periods '.' to denote a decimal place.
    • this will match whole numbers and decimaled numbers alike
  • *
    • match zero or more spaces (not in a group)
  • ([A-Za-z]*)
    • another group
    • match 0 or more letters of the alphabet (lower or uppercase)
  • ( *, *)*
    • a group, but being used to identify a unit
    • which we are matching 0 or more times
    • 0 or more spaces, followed by a comma, followed by 0 or more spaces

This pattern (which honestly could probably be further tuned) seems to adequately describe our input data, in any of its likely forms.

code

Following is a sample C program I wrote to explore the regex functionality (which I am putting to use in another endeavor of mine). It matches the different groups, and uniquely prints them out (and if you can print them out, you can otherwise manipulate them, or place them into variables).

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <regex.h>
 
typedef unsigned int ui;
 
int main ()
{
    char       *source        = "4.86 foo , 5.12bar,2.54 baz,,  16 ,  .416p";
    char       *regex         = "([0-9]*.?[0-9]*) *([a-z]*)( *, *)*";
    char       *rpos          = NULL;
    int         status        = 0;
    size_t      numgroups     = 3;
 
    char       *loc           = source;
    ui offset                 = 0;
 
    regex_t     regexdata;
    regmatch_t  group[numgroups];
 
    status                    = regcomp (&regexdata, regex, REG_EXTENDED);
    if (status               != 0)
    {
        fprintf (stderr, "Could not compile regular expression.\n");
        return (1);
    }
 
    status                    = regexec(&regexdata, loc, numgroups, group, 0);
    while (status            != REG_NOMATCH)
    {
        offset                = 0;
 
        if (*(loc+offset)    == '\0')
            break;
 
        offset                = group[0].rm_eo;
 
        rpos                  = (char *) malloc (sizeof (char) * (strlen (loc) + 1));
        strcpy (rpos, loc);
        rpos[group[1].rm_eo]  = 0;
        fprintf (stdout, "number: %5.2f, ", atof(rpos + group[1].rm_so));
        strcpy (rpos, loc);
        rpos[group[2].rm_eo]  = 0;
        fprintf (stdout, "unit:   %3s\n", rpos  + group[2].rm_so);
        loc                   = loc + offset;
        status                = regexec (&regexdata, loc, numgroups, group, 0);
    }
 
    regfree (&regexdata);
 
    return (0);
}

compile

Compiling is straightforward:

$ gcc -o regex regex.c

executing

And we have the results, nicely output to STDOUT:

$ ./regex
number:  4.86, unit:   foo
number:  5.12, unit:   bar
number:  2.54, unit:   baz
number: 16.00, unit:      
number:  0.42, unit:     p

Isn't that just a beautiful thing…

Resources

The following sites were consulted while exploring this endeavor:

haxx/examples/regex_clang.txt · Last modified: 2018/03/04 14:36 by wedge