SteadBytes |TPP Topic 21: Text Manipulation

See the first post in The Pragmatic Programmer 20th Anniversary Edition series for an introduction.

Exercise 11

You’re rewriting an application that used to use YAML as a configuration language. Your company has now standardized on JSON, so you have a bunch of .yaml files that need to be turned into .json. Write a script that takes a directory and converts each .yaml file into a corresponding .json file (so database.yaml becomes database.json, and the contents are valid JSON).

Conversion between YAML and JSON can be done easily in Python using PyYAML and the standard library json module. PyYaml and json both convert basic Python objects (dict, list, str, int e.t.c) to and from YAML and JSON respectively, both providing an almost identical API.

Algorithm outline:

Find all YAML files in a given directory
Load a YAML file from disk into a Python object using PyYaml
Serialize the Python object to JSON
Write the serialized JSON to a new .json file
Delete the YAML file
Repeat steps 2-5 for all YAML files from step 1

import json
from pathlib import Path

import yaml

def main(d: Path):
    # make sure we have a valid directory to search in
    assert d.exists()
    assert d.is_dir()

    for yaml_f in d.glob("*.yaml"):
        # load original YAML data
        with yaml_f.open() as f:
            data = yaml.safe_load(f)
        # write data to new json file
        json_f = d / f"{yaml_f.stem}.json"
        with json_f.open(mode="w") as f:
            json.dump(data, f, indent=4)
        # delete original YAML file
        yaml_f.unlink()
        print(f"{yaml_f} -> {json_f}")

See yaml_to_json.py on GitHub for the full CLI script.

Exercise 12

Your team initially chose to use camelCase names for variables, but then changed their collective mind and switched to snake_case. Write a script that scans all the source files for camelCase names and reports on them.

In a real world scenario where I just need to accomplish the task, I would definitely solve this with egrep:

$ egrep -rn --color "\b[a-z]+((\d)|([A-Z0-9][a-z0-9]+))+([A-Z])?" /path/to/source/directory

However, for educational purposes, I have implemented this functionality in a Python script along with unit tests. See camel_finder.py on GitHub for the full CLI script and tests.

The task breaks down into the following high level steps:

Iterate over a sequence of files (provided as arguments to the script)
Iterate over the lines of each file
Identify any camelCase strings in each line
Output a report of each identified camelCase string

Steps 1 and 2 are trivial in Python:

import sys
from pathlib import Path


def main(files: List[Path]):
    for file in files:
        with file.open() as f:
            for line in f:
                # do steps 3 and 4


if __name__ == "__main__":
    if len(sys.argv) == 1:
        err_exit("Usage: ./camel_finder.py FILES...")

    main([Path(f) for f in sys.argv[1:]]

Step 3 can be achieved with a regular expression as shown in the egrep example. For this exercise I am using the Google Java style guide definition of camelCase for lower camelCase - not PascalCase.

import re


CAMEL_RE = re.compile(
    (
        # 1st character must be lower case
        r"\b[a-z]+"
        # followed by a single digit
        # OR upper case character/number followed by lower case characters or number
        r"((\d)|([A-Z0-9][a-z0-9]+))"
        # final character *may* be upper case
        r"+([A-Z])?"
    )
)

Demo of this regex: https://regex101.com/r/paY5R6/1/

Using CAMEL_RE, the positions of camelCase substrings can be extracted from a string. The representation of a given line and the positions of any camelCase substrings contained within it are provided by two NamedTuple objects: MatchGroup and Match. These are inspired by the standard library re.Match objects which are used to provide the data for MatchGroup and Match. This provides a separation boundary between the concept of “a line of text with camelCase substrings” and “a regular expression that matches camelCase substrings”. Thus allowing the underlying mechanism for finding the camelCase substrings to more easily be changed if necessary.

from typing import Generator, Iterable, NamedTuple


class MatchGroup(NamedTuple):
    start: int
    end: int


class Match(NamedTuple):
    lineno: int
    groups: List[MatchGroup]
    line: str


def find_camel(lines: Iterable[str]) -> Generator[Match, None, None]:
    for i, line in enumerate(lines):
        groups = [MatchGroup(*m.span()) for m in CAMEL_RE.finditer(line)]
        if groups:
            yield Match(i, groups, line)

Example usage:

>>> lines = ["a camelCase line", "thisIs a line with two camelCase words", "a line without camel case"]
>>> for match in find_camel(lines):
...     print(match)
Match(lineno=0, groups=[MatchGroup(start=2, end=11)], line='a camelCase line')
Match(lineno=1, groups=[MatchGroup(start=0, end=6), MatchGroup(start=23, end=32)], line='thisIs a line with two camelCase words')

Step 4 (without installing third party packages) involves using ANSI colour escape codes. To emulate the --color option of grep, each matched line should be output with the following format:

<optional purple filename>:<green line number>: <white text> <red camelCase match> <white text>...

For example:

The ‘pretty match’ string is built up by iterating through each MatchGroup of a given Match and extracting a slice of the original line containing non-matched text before the position of the MatchGroup, extracting the slice of the original line where at the position of the MatchGroup and adding the red escape code : \033[31m.

Note that after each use of a colour escape code, the colour is reset using \033[m

def pretty_match(m: Match, filename: str = None) -> str:
    """
    Build a 'pretty' string representation of `m`, with coloured text and line
    numbers; optionally prefixed with `filename`.

    Colours:
        - Line numbers = green
        - Matches = red
        - Filenames = purple
        - Non-matching text = white
    """
    pretty_name = f"\033[35m{filename}\033[m:" if filename else ""
    l = []
    prev = 0
    for g in m.groups:
        # text up until match in white, match in red
        l.append((f"{m.line[prev:g.start]}"f"\033[31m{m.line[g.start:g.end]}\033[m"))
        prev = g.end
    l.append(m.line[prev:])
    return "".join([f"{pretty_name}\033[32m{m.lineno}\033[m:"] + l)

Bringing it all together in the original main function:

def main(files: List[Path]):
    """
    Print a report on the locations of all camelCase strings in `file`. See
    `pretty_match` for output format.
    """
    show_filenames = len(files) > 1
    for file in files:
        with file.open() as f:
            for m in find_camel(f):
                print(pretty_match(m, filename=file if show_filenames else None))

Exercise 13

Following on from the previous exercise, add the ability to change those variable names automatically in one or more files. Remember to keep a backup of the originals in case something goes horribly, horribly wrong.

Again, see camel_finder.py on GitHub for the full CLI script and tests.

Once a camelCase substring has been found, converting it to snake_case requires two steps:

Insert an _ character between each camelCase ‘hump’
- camelCase -> camel_Case
- camelCamelCase -> camel_Camel_Case
Convert to lower case

Regex substitution using a capture group can be used for step 1 by matching and capturing a ‘hump’ character:

camel(C)ase
camel(C)amel(C)ase

Then inserting an _ character before the ‘hump’ character by referencing the capture group in the substitution: '_\1'.


CONVERT_CAMEL_RE = re.compile(
    (
        # match a nomral 'hump' i.e. camel(C)ase or camel1(C)ase
        r"((?<=[a-z0-9])[A-Z]"
        # match mid-string uppercase humps, ignoring existing underscores
        # i.e. CAMEL(C)ase or HTTP(E)rror
        r"|(?!^)(?<!_)[A-Z](?=[a-z]))"
    )
)


def convert_camel_word(w: str) -> str:
    return CONVERT_CAMEL_RE.sub(r"_\1", w).lower()

Demo of this regex: https://regex101.com/r/3AjsDU/1

Example usage:

>>> convert_camel_word("camelCase")
'camel_case'

>>> convert_camel_word("camelCamelCase")
'camel_camel_case'

>>> convert_camel_word("snakey_camelCase")
'snakey_camel_case'

convert_camel_word is designed to convert a single camelCase word, not strings of arbitrary text containing camelCase words. For example:

>>> convert_camel_word("System.out.println(Arrays.toString(myArray));")
'system.out.println(_arrays.to_string(my_array));'

Instead of:

'System.out.println(Arrays.to_string(my_array));'

To convert strings of arbitrary text, the original camelCase matching from exercise 12 is used to first find the locations of individual camelCase strings. Once found, they can be passed to convert_camel_word and inserted into the correct position of the original text.


# factored out of existing find_camel function
def find_match_groups(s: str) -> list:
    return [MatchGroup(*m.span()) for m in CAMEL_RE.finditer(s)]


def convert_camel_line(l: str) -> str:
    # find individual camelCase strings within `l`
    for g in find_match_groups(l):
        # replace with snake_case equivalent
        l = l[0 : g.start] + convert_camel_word(l[g.start : g.end]) + l[g.end :]
    return l


def convert_camel(lines: Iterable[str]) -> Generator[str, None, None]:
    return (convert_camel_line(l) for l in lines)

Applying the conversion to a file involves backing up the original file, converting each line in turn and writing to a new file:

def transform_camel(file: Path):
    """
    Transform all occurences of camelCase strings in `file` to snake_case. The
    original file is renamed with a ".backup" extension to prevent data loss.
    """
    original = Path(str(file) + ".backup")
    file.rename(original)
    with original.open() as source:
        with file.open("w") as dest:
            dest.writelines(convert_camel(source)

To wrap it up, the existing main entry point delegates according to whether reporting or transformation is required and an additional --convert command line argument is added. Command line argument parsing is now performed by argparse instead of manually inspecting sys.argv:

import argparse


def main(files: List[Path], convert=False):
    for f in files:
        if convert:
            convert_camel(f)
        else:
            report_camel(f, show_filenames=len(files) > 1)


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description=sys.modules[__name__].__doc__)
    parser.add_argument("files", nargs="+", help="source files to scan for camelCase")
    parser.add_argument(
        "--convert",
        action="store_true",
        dest="convert",
        help="perform camelCase to snake_case conversion",
    )
    args = parser.parse_args()
    main([Path(f) for f in args.files], convert=args.convert)

CLI help text:

$ ./camel_finder.py -h
usage: camel_finder.py [-h] [--convert] files [files ...]

Scan source files for camelCase strings, reporting (grep style) on locations
or converting to snake_case. During conversion, orginal files are renamed with
a ".backup" extension. Yes I'm aware this can probably be achieved with a bash
one-liner.

positional arguments:
  files       source files to scan for camelCase

optional arguments:
  -h, --help  show this help message and exit
  --convert   perform camelCase to snake_case conversio

TPP Topic 21: Text Manipulation

Contents

Exercise 11

Exercise 12

Exercise 13

Feedback