Posted on May 6, 2008 at 3:24 A.M.

There are many times when the programming task at hand is to iterate over some semi-structured text, transform parts of that text in some way, and reintegrate those transformed parts back into the original text.

Typically using a regular expression with re.sub and a callback function, but sometimes you want a bit more control of the process (especially over those parts that dont match the regex). Usually my solution is to write a one-off function that does it, but today I had to write that function yet again and decided to generalize it and post it here.

To be completely honest, this post is more for my own archival purposes than for the internet as a whole, but if anyone else finds it useful, then I'm ecstatic.

def re_parts(regex_list, text):
    """
    An iterator that returns the entire text, but split by which regex it
    matched, or none at all.  If it did, the first value of the returned tuple
    is the index into the regex list, otherwise -1.

    >>> first_re = re.compile('asdf')
    >>> second_re = re.compile('an')
    >>> list(re_parts([first_re, second_re], 'This is an asdf test.'))
    [(-1, 'This is '), (1, 'an'), (-1, ' '), (0, 'asdf'), (-1, ' test.')]

    >>> list(re_parts([first_re, second_re], 'asdfasdfasdf'))
    [(0, 'asdf'), (0, 'asdf'), (0, 'asdf')]

    >>> list(re_parts([], 'This is an asdf test.'))
    [(-1, 'This is an asdf test.')]

    >>> third_re = re.compile('sdf')
    >>> list(re_parts([first_re, second_re, third_re], 'This is an asdf test.'))
    [(-1, 'This is '), (1, 'an'), (-1, ' '), (0, 'asdf'), (-1, ' test.')]
    """
    def match_compare(x, y):
        return x.start() - y.start()
    prev_end = 0
    iters = [r.finditer(text) for r in regex_list]
    matches = []
    while iters:
        if matches:
            match = matches.pop(0)
            (start, end) = match.span()
            if start > prev_end:
                yield (-1, text[prev_end:start])
                yield (regex_list.index(match.re), text[start:end])
            elif start == prev_end:
                yield (regex_list.index(match.re), text[start:end])
            prev_end = end
        else:
            matches = []
            for iterator in iters:
                try:
                    matches.append(iterator.next())
                except StopIteration:
                    iters.remove(iterator)
            matches = sorted(matches, match_compare)
    last_bit = text[prev_end:]
    if len(last_bit) > 0:
        yield (-1, last_bit)

Search

 

Recent Links

  • git-issues: A distributed issue tracker built-in to Git.
  • I predicted this back in March--can't believe a solution has surfaced so soon. It makes so much sense to build in an issue tracker to a revision control system. Since you're working with the code, might as well track the issues in the same system and take advantage of the extra metadata. This is really cool (and as a bonus, it's written in Python) so I hope to see it really grow and flourish!

  • How I Work Daily
  • Daily blog by Kevin Fricovsky. In addition to having some really great content, he has started to post audio interviews with people from the Django community. This is a site to keep an eye on in the coming days and months.

  • django-arcade
  • Demo site for django-arcade, an open source reusable Django app to add new flash games to any django-powered site. Looks very cool for easily creating game portals. It also comes from my future employer.

  • Facebook Chat and Scalability (with Erlang)
  • Eugene Letuchy talks about how they they took Facebook Chat from no users to 70 million users, with the help of Erlang.

  • Simon Willison: The Implications of OpenID
  • I somehow missed this presentation when it came out, but it's an absolutely fantastic overview and defense of OpenID by Simon Willison. If you are in any way interested in what OpenID is and what it can offer, you owe it to yourself to check out this presentation.

  • StupidXML
  • Probably the simplest XML library that I've seen for Python. Sometimes you just want to generate some stupid XML, and this is the perfect tool for the job.

  • See the rest of my links...

Pownce

Badges

  • django badge
  • apache badge
  • GeoURL
  • XFN Friendly
  • Valid HTML 4.01 Transitional