See the first post in The Pragmatic Programmer 20th Anniversary Edition series for an introduction.
Exercise 11
You’re rewriting an application that used to use YAML as a configuration language. Your company has now standardized on JSON, so you have a bunch of
.yaml
files that need to be turned into.json
. Write a script that takes a directory and converts each.yaml
file into a corresponding.json
file (sodatabase.yaml
becomesdatabase.json
, and the contents are valid JSON).
Conversion between YAML and JSON can be done easily in Python using PyYAML and the standard library json
module. PyYaml and json
both convert basic Python objects (dict
, list
, str
, int
e.t.c) to and from YAML and JSON respectively, both providing an almost identical API.
Algorithm outline:
- Find all YAML files in a given directory
- Load a YAML file from disk into a Python object using PyYaml
- Serialize the Python object to JSON
- Write the serialized JSON to a new
.json
file - Delete the YAML file
- Repeat steps 2-5 for all YAML files from step 1
import json
from pathlib import Path
import yaml
def main(d: Path):
# make sure we have a valid directory to search in
assert d.exists()
assert d.is_dir()
for yaml_f in d.glob("*.yaml"):
# load original YAML data
with yaml_f.open() as f:
data = yaml.safe_load(f)
# write data to new json file
json_f = d / f"{yaml_f.stem}.json"
with json_f.open(mode="w") as f:
json.dump(data, f, indent=4)
# delete original YAML file
yaml_f.unlink()
print(f"{yaml_f} -> {json_f}")
See yaml_to_json.py
on GitHub for the full CLI script.
Exercise 12
Your team initially chose to use
camelCase
names for variables, but then changed their collective mind and switched tosnake_case
. Write a script that scans all the source files forcamelCase
names and reports on them.
In a real world scenario where I just need to accomplish the task, I would definitely solve this with egrep
:
$ egrep -rn --color "\b[a-z]+((\d)|([A-Z0-9][a-z0-9]+))+([A-Z])?" /path/to/source/directory
However, for educational purposes, I have implemented this functionality in a Python script along with unit tests. See camel_finder.py
on GitHub for the full CLI script and tests.
The task breaks down into the following high level steps:
- Iterate over a sequence of files (provided as arguments to the script)
- Iterate over the lines of each file
- Identify any
camelCase
strings in each line - Output a report of each identified
camelCase
string
Steps 1 and 2 are trivial in Python:
import sys
from pathlib import Path
def main(files: List[Path]):
for file in files:
with file.open() as f:
for line in f:
# do steps 3 and 4
if __name__ == "__main__":
if len(sys.argv) == 1:
err_exit("Usage: ./camel_finder.py FILES...")
main([Path(f) for f in sys.argv[1:]]
Step 3 can be achieved with a regular expression as shown in the egrep
example. For
this exercise I am using the Google Java style guide definition of camelCase
for lower camelCase
- not PascalCase.
import re
CAMEL_RE = re.compile(
(
# 1st character must be lower case
r"\b[a-z]+"
# followed by a single digit
# OR upper case character/number followed by lower case characters or number
r"((\d)|([A-Z0-9][a-z0-9]+))"
# final character *may* be upper case
r"+([A-Z])?"
)
)
- Demo of this regex: https://regex101.com/r/paY5R6/1/
Using CAMEL_RE
, the positions of camelCase
substrings can be extracted from a string. The representation of a given line and the positions of any camelCase
substrings contained within it are provided by two NamedTuple
objects: MatchGroup
and Match
. These are inspired by the standard library re.Match
objects which are used to provide the data for MatchGroup
and Match
. This provides a separation boundary between the concept of “a line
of text with camelCase
substrings” and “a regular expression that matches camelCase
substrings”. Thus allowing the underlying mechanism for finding the camelCase
substrings
to more easily be changed if necessary.
from typing import Generator, Iterable, NamedTuple
class MatchGroup(NamedTuple):
start: int
end: int
class Match(NamedTuple):
lineno: int
groups: List[MatchGroup]
line: str
def find_camel(lines: Iterable[str]) -> Generator[Match, None, None]:
for i, line in enumerate(lines):
groups = [MatchGroup(*m.span()) for m in CAMEL_RE.finditer(line)]
if groups:
yield Match(i, groups, line)
Example usage:
>>> lines = ["a camelCase line", "thisIs a line with two camelCase words", "a line without camel case"]
>>> for match in find_camel(lines):
... print(match)
Match(lineno=0, groups=[MatchGroup(start=2, end=11)], line='a camelCase line')
Match(lineno=1, groups=[MatchGroup(start=0, end=6), MatchGroup(start=23, end=32)], line='thisIs a line with two camelCase words')
Step 4 (without installing third party packages) involves using ANSI colour escape codes. To emulate the --color
option of grep
, each matched
line should be output with the following format:
<optional purple filename>:<green line number>: <white text> <red camelCase match> <white text>...
For example:
The ‘pretty match’ string is built up by iterating through each MatchGroup
of a
given Match
and extracting a slice of the original line containing non-matched text
before the position of the MatchGroup
, extracting the slice of the original line
where at the position of the MatchGroup
and adding the red escape code : \033[31m
.
- Note that after each use of a colour escape code, the colour is reset using
\033[m
def pretty_match(m: Match, filename: str = None) -> str:
"""
Build a 'pretty' string representation of `m`, with coloured text and line
numbers; optionally prefixed with `filename`.
Colours:
- Line numbers = green
- Matches = red
- Filenames = purple
- Non-matching text = white
"""
pretty_name = f"\033[35m{filename}\033[m:" if filename else ""
l = []
prev = 0
for g in m.groups:
# text up until match in white, match in red
l.append((f"{m.line[prev:g.start]}"f"\033[31m{m.line[g.start:g.end]}\033[m"))
prev = g.end
l.append(m.line[prev:])
return "".join([f"{pretty_name}\033[32m{m.lineno}\033[m:"] + l)
Bringing it all together in the original main
function:
def main(files: List[Path]):
"""
Print a report on the locations of all camelCase strings in `file`. See
`pretty_match` for output format.
"""
show_filenames = len(files) > 1
for file in files:
with file.open() as f:
for m in find_camel(f):
print(pretty_match(m, filename=file if show_filenames else None))
Exercise 13
Following on from the previous exercise, add the ability to change those variable names automatically in one or more files. Remember to keep a backup of the originals in case something goes horribly, horribly wrong.
Again, see camel_finder.py
on GitHub for the full CLI script and tests.
Once a camelCase
substring has been found, converting it to snake_case requires two steps:
- Insert an
_
character between eachcamelCase
‘hump’camelCase -> camel_Case
camelCamelCase -> camel_Camel_Case
- Convert to lower case
Regex substitution using a capture group can be used for step 1 by matching and capturing a ‘hump’ character:
camel(C)ase
camel(C)amel(C)ase
Then inserting an _
character before the ‘hump’ character by referencing the capture group in the substitution: '_\1'
.
CONVERT_CAMEL_RE = re.compile(
(
# match a nomral 'hump' i.e. camel(C)ase or camel1(C)ase
r"((?<=[a-z0-9])[A-Z]"
# match mid-string uppercase humps, ignoring existing underscores
# i.e. CAMEL(C)ase or HTTP(E)rror
r"|(?!^)(?<!_)[A-Z](?=[a-z]))"
)
)
def convert_camel_word(w: str) -> str:
return CONVERT_CAMEL_RE.sub(r"_\1", w).lower()
- Demo of this regex: https://regex101.com/r/3AjsDU/1
Example usage:
>>> convert_camel_word("camelCase")
'camel_case'
>>> convert_camel_word("camelCamelCase")
'camel_camel_case'
>>> convert_camel_word("snakey_camelCase")
'snakey_camel_case'
convert_camel_word
is designed to convert a single camelCase
word, not strings
of arbitrary text containing camelCase
words. For example:
>>> convert_camel_word("System.out.println(Arrays.toString(myArray));")
'system.out.println(_arrays.to_string(my_array));'
Instead of:
'System.out.println(Arrays.to_string(my_array));'
To convert strings of arbitrary text, the original camelCase
matching from exercise 12 is used to first find
the locations of individual camelCase
strings. Once found, they can be passed to
convert_camel_word
and inserted into the correct position of the original text.
# factored out of existing find_camel function
def find_match_groups(s: str) -> list:
return [MatchGroup(*m.span()) for m in CAMEL_RE.finditer(s)]
def convert_camel_line(l: str) -> str:
# find individual camelCase strings within `l`
for g in find_match_groups(l):
# replace with snake_case equivalent
l = l[0 : g.start] + convert_camel_word(l[g.start : g.end]) + l[g.end :]
return l
def convert_camel(lines: Iterable[str]) -> Generator[str, None, None]:
return (convert_camel_line(l) for l in lines)
Applying the conversion to a file involves backing up the original file, converting each line in turn and writing to a new file:
def transform_camel(file: Path):
"""
Transform all occurences of camelCase strings in `file` to snake_case. The
original file is renamed with a ".backup" extension to prevent data loss.
"""
original = Path(str(file) + ".backup")
file.rename(original)
with original.open() as source:
with file.open("w") as dest:
dest.writelines(convert_camel(source)
To wrap it up, the existing main
entry point delegates according to whether reporting
or transformation is required and an additional --convert
command line argument is added. Command line argument parsing is now performed by argparse
instead of manually inspecting sys.argv
:
import argparse
def main(files: List[Path], convert=False):
for f in files:
if convert:
convert_camel(f)
else:
report_camel(f, show_filenames=len(files) > 1)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description=sys.modules[__name__].__doc__)
parser.add_argument("files", nargs="+", help="source files to scan for camelCase")
parser.add_argument(
"--convert",
action="store_true",
dest="convert",
help="perform camelCase to snake_case conversion",
)
args = parser.parse_args()
main([Path(f) for f in args.files], convert=args.convert)
CLI help text:
$ ./camel_finder.py -h
usage: camel_finder.py [-h] [--convert] files [files ...]
Scan source files for camelCase strings, reporting (grep style) on locations
or converting to snake_case. During conversion, orginal files are renamed with
a ".backup" extension. Yes I'm aware this can probably be achieved with a bash
one-liner.
positional arguments:
files source files to scan for camelCase
optional arguments:
-h, --help show this help message and exit
--convert perform camelCase to snake_case conversio