See the first post in The Pragmatic Programmer 20th Anniversary Edition series for an introduction.
Challenge 1
Design a small address book database (name, phone number, and so on) using a straightforward binary representation in your language of choice. Do this before reading the rest of this challenge.
- Translate that format into a plain-text format using XML or JSON.
- For each version, add a new, variable-length field called directions in which you might enter directions to each person’s house.
What issues come up regarding versioning and extensibility? Which form was easier to modify? What about converting existing data?
Full code can be found on GitHub.
Version 1
Data Model
Each address book record is represented by a Person
class containing basic personal information and address fields. A unique Id is also provided for each record using a UUID. Storing addresses universally is quite complex, however as this is not a challenge about data modelling I have assumed a very basic model of a UK address:
# address_book/models.py
from dataclasses import dataclass, field
from uuid import uuid4
@dataclass
class Person:
first_name: str
last_name: str
phone_number: str
house_number: str
street: str
town: str
postcode: str
id: str = field(default_factory=lambda: str(uuid4()))
I’m using Python 3.7 Dataclasses because Person
is mainly (apart from Id
generation) a Data Transfer Object(DTO). Usage:
>>> Person("Ben", "Steadman", "+1-087-184-1440", "1", "A Road", "My Town", "CB234")
Person(first_name='Ben', last_name='Steadman', phone_number='+1-087-184-1440', house_number='1', street='A Road', town='My Town', postcode='CB234', id='a14fe77b-b5d2-46e7-b42c-9392b4bbec28')
To aid testing, generate_people
will generate arbitrary People
instances using the excellent Faker library:
# address_book/models.py
from faker import Faker
fake = Faker("en_GB")
def generate_people(n: int) -> Iterable[Person]:
for _ in range(n):
yield Person(
fake.first_name(),
fake.last_name(),
fake.phone_number(),
fake.building_number(),
fake.street_name(),
fake.city(),
fake.postcode(),
)
Usage:
>>> list(generate_people(2))
[
Person(
first_name="Victor",
last_name="Pearce",
phone_number="01184960739",
house_number="2",
street="Mohamed divide",
town="Charleneburgh",
postcode="LS7 0DJ",
id="cb242277-44dd-4836-98c7-ddbe10183fb4",
),
Person(
first_name="Stanley",
last_name="Ashton",
phone_number="(0131) 496 0908",
house_number="2",
street="Karen bridge",
town="Port Gailland",
postcode="L3J 2YF",
id="ef85cfd1-08eb-4629-8747-3d8be1580fc7",
),
]
Binary Representation
As this challenge is about data formats and not building a database, I’m interpreting address book database as a file containing a list of address book records - not a DBMS.
To convert between the Person
class and a binary representation the Python struct
can be used.
Performs conversions between Python values and C structs represented as Python bytes opjects.
– Python struct documentation
Person
can be represented using the following Struct
:
import struct
PersonStruct = struct.Struct("50s50s30s10s50s50s10s36s")
Which corresponds to the following C struct:
struct Person {
char first_name[50];
char last_name[50];
char phone_number[30];
char house_number[10];
char street[50];
char town[50];
char postcode[10];
char id[36];
};
Binary packing/unpacking usage:
>>> as_bytes = PersonStruct.pack(b'Ben', b'Steadman', b'+44(0)116 4960124', b'1', b'A Road', b'My Town', b'CB234', b'b36cb798-946e-4dca-b89c-f393616feb7b')
>>> as_bytes
b'Ben\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00Steadman\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00+44(0)116 4960124\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x001\x00\x00\x00\x00\x00\x00\x00\x00\x00A Road\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00My Town\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00CB234\x00\x00\x00\x00\x00b36cb798-946e-4dca-b89c-f393616feb7b'
>>> PersonStruct.unpack(as_bytes)(b'Ben\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', b'Steadman\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', b'+44(0)116 4960124\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', b'1\x00\x00\x00\x00\x00\x00\x00\x00\x00', b'A Road\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', b'My Town\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', b'CB234\x00\x00\x00\x00\x00', b'b36cb798-946e-4dca-b89c-f393616feb7b')
- Note how the values of the
tuple
returned fromPersonStruct.unpack
are padded with\x00
(null bytes) due to the struct format specifying a larger length string than the original values provided. These will need to be removed during unpacking intoPerson
objects.
To provide a higher level of abstraction over these raw bytes, the conversion functionality can be wrapped up into some functions which deal with Person
objects:
# address_book/binary.py
import struct
from dataclasses import astuple
from models import Person
PersonStruct = struct.Struct("50s50s30s10s50s50s10s36s")
def from_bytes(buffer: bytes) -> Person:
return Person(
*(
# remove null bytes added by string packing
x.decode("utf-8").rstrip("\x00")
for x in PersonStruct.unpack(buffer)
)
)
def to_bytes(p: Person) -> bytes:
return PersonStruct.pack(
*(s.encode("utf-8") for s in astuple(p))
)
Usage:
>>> me = Person("Ben", "Steadman", "+44(0)116 4960124", "1", "A Road", "My Town", "CB234")
>>> as_bytes = to_bytes(me)
>>> me_again = from_bytes(me)
>>> me == me_again
True
These Person
conversion functions can be used in higher level functions to read and write an entire address book database:
# address_book/binary.py
from functools import partial
from pathlib import Path
from typing import Iterable, List
def read_address_book(db: Path) -> List[Person]:
people = []
with db.open("rb") as f:
for chunk in iter(partial(f.read, PersonStruct.size), b""):
people.append(from_bytes(chunk))
return people
def write_address_book(db: Path, people: Iterable[Person]):
with db.open("wb") as f:
f.write(b"".join(to_bytes(p) for p in people))
Usage:
>>> people = list(generate_people(50))
>>> db = Path("data/address-book.bin")
>>> write_address_book(db, people)
>>> people_again = read_address_book(db)
>>> people == people_again
True
Plain Text Representation
I’ve chosen JSON as the plain text format due to the excellent Python standard library json
module making it easy to work with. Using the same Person
model, the functions from_dict
and to_dict
are analogous to from_bytes
and to_bytes
respectively as the json
module converts JSON objects to and from Python dictionaries.
# address_book/plain_text.py
from dataclasses import asdict
from .models import Person
def from_dict(d: dict) -> Person:
return Person(**d)
def to_dict(p: Person) -> dict:
return asdict(p)
Usage:
>>> me = Person("Ben", "Steadman", "+44(0)116 4960124", "1", "A Road", "My Town", "CB234")
>>> as_dict = to_dict(me)
>>> me_again = from_dict(as_dict)
>>> me == me_again
True
These can then be used to create JSON versions of read_address_book
and write_address_book
:
# address_book/plain_text.py
import json
from functools import partial
from pathlib import Path
from typing import Iterable, List
def read_address_book(db: Path) -> List[Person]:
with db.open() as f:
return [from_dict(d) for d in json.load(f)]
def write_address_book(db: Path, people: Iterable[Person]):
with db.open("w") as f:
json.dump([to_dict(p) for p in people], f)
Usage:
>>> people = list(generate_people(50))
>>> db = Path("data/address-book.json")
>>> write_address_book(db, people)
>>> people_again = read_address_book(db)
>>> people == people_again
True
Tests
Each implementation is also covered by a set of simple unit tests, asserting the correctness of the conversions to and from their respective formats:
import pytest
from address_book import binary, plain_text
from address_book.models import Person, generate_people
@pytest.mark.parametrize("p", generate_people(50))
def test_to_bytes_inverts_from_bytes(p):
p_bytes = binary.to_bytes(p)
p_again = binary.from_bytes(p_bytes)
assert p == p_again
@pytest.mark.parametrize("p", generate_people(50))
def test_to_dict_inverts_from_dict(p):
p_dict = plain_text.to_dict(p)
p_again = plain_text.from_dict(p_dict)
assert p == p_again
@pytest.mark.parametrize(
"module,fname", [(binary, "address-book.bin"), (plain_text, "address-book.json")]
)
def test_write_address_book_inverts_read_address_book(module, fname, tmp_path):
db = tmp_path / fname
# sanity check
assert db.exists() is False
people = list(generate_people(50))
module.write_address_book(db, people)
assert db.exists() is True
assert db.stat().st_size > 0
people_again = module.read_address_book(db)
assert people == people_again
Version 2 (variable length directions
)
Adding the additional directions
field to the model is simple enough:
from dataclasses import dataclass, field
from typing import Iterable
from uuid import uuid4
from faker import Faker
fake = Faker("en_GB")
@dataclass
class Person:
first_name: str
last_name: str
phone_number: str
house_number: str
street: str
town: str
postcode: str
directions: str # new
id: str = field(default_factory=lambda: str(uuid4()))
def generate_people(n: int) -> Iterable[Person]:
for _ in range(n):
yield Person(
fake.first_name(),
fake.last_name(),
fake.phone_number(),
fake.building_number(),
fake.street_name(),
fake.city(),
fake.postcode(),
# new
fake.text(), # random latin is about as useful as most directions
)
Binary Representation
Since the struct
module deals with C structs, strings are represented as C char
arrays of a fixed length specified in the format string i.e. struct.pack("11s", "hello world")
. To achieve this in generality is quite an involved process and if you need to this for a real application, using a third party library such as NetStruct would be recommended. For the purpose of this challenge, however, I won’t be using it and nor will I be implementing a general solution - the code for packing/unpacking records is very tightly coupled to the structure of the records and I would not recommend following this approach in a real application. However, it does demonstrate the difficulties that can arise when using binary formats.
Since the size of the directions
field is variable, the complete format string for packing/unpacking of records using struct
must be dynamically created:
>>> me = Person(
"Ben",
"Steadman",
"+44(0)116 4960124",
"1",
"A Road",
"My Town",
"CB234",
"Take a left at the roundabout",
)
>>> fmt = "50s50s30s10s50s50s10s{}s36s".format(len(me.directions))
>>> struct.pack(fmt, *(s.encode("utf-8") for s in astuple(me)))
b'Ben\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00Steadman\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00+44(0)116 4960124\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x001\x00\x00\x00\x00\x00\x00\x00\x00\x00A Road\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00My Town\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00CB234\x00\x00\x00\x00\x00Take a left at the roundaboutbfe3c3e5-8b65-4e49-8d26-3981257a0dee'
Furthermore, since each packed record will be of a different size the database file cannot cannot simply be read in equal sized chunks and passed to from_bytes
as in the first implementation. To solve this, each record is preceded by it’s size in bytes. This value can be used to determine the next chunk size to read from the file and pass to from_bytes
. :
# address_book/binary.py
PERSON_STRUCT_FMT = "50s50s30s10s50s50s10s{}s36s"
def to_bytes(p: Person) -> Tuple[bytes, int]:
# dynamically add size to format for variable length directions field
fmt = PERSON_STRUCT_FMT.format(len(p.directions))
return (
struct.pack(fmt, *(s.encode("utf-8") for s in astuple(p))),
struct.calcsize(fmt),
)
RecordSizeStruct = struct.Struct("I")
def write_address_book(db: Path, people: Iterable[Person]):
with db.open("wb") as f:
records_with_sizes = (
RecordSizeStruct.pack(size) + p_bytes
for p_bytes, size in (to_bytes(p) for p in people)
)
f.write(b"".join(records_with_sizes))
to_bytes
still receives a buffer
of bytes representing an entire packed record, however to handle the variable length directions
field it needs to calculate the position within buffer
at which the directions
field must begin, split it accordingly and unpack each section individually:
# address_book/binary.py
def from_bytes(buffer: bytes) -> Person:
# calculate sizes of non-variable formats
before_fmt, after_fmt = PERSON_STRUCT_FMT.split("{}s")
before_start = struct.calcsize(before_fmt)
after_start = len(buffer) - struct.calcsize(after_fmt)
before, direction, after = (
buffer[:before_start],
buffer[before_start:after_start],
buffer[after_start:],
)
# dynamically build struct format string for variable length field
direction_fmt = "{}s".format(len(direction))
data = (
struct.unpack(before_fmt, before)
+ struct.unpack(direction_fmt, direction)
+ struct.unpack(after_fmt, after)
)
return Person(*(x.decode("utf-8").rstrip("\x00") for x in data))
def read_address_book(db: Path)->List[Person]:
people = []
with db.open("rb") as f:
while True:
# each record preceded by its size in bytes, use to determine number
# of bytes to read from db for the entire record
size_buf = f.read(RecordSizeStruct.size)
if not size_buf:
break # reached end of db
record_size = RecordSizeStruct.unpack(size_buf)[0]
people.append(from_bytes(f.read(record_size)))
return people
A slight adjustment to the tests is needed to account for to_bytes
now returning a tuple
:
@pytest.mark.parametrize("p", generate_people(50))
def test_to_bytes_inverts_from_bytes(p):
p_bytes, size = binary.to_bytes(p)
p_again = binary.from_bytes(p_bytes)
assert p == p_again
Plain Text Representation
Other than the changes to the Person
class, no further changes are required to support the new variable length field.
Summary
Though I already agreed with the authors preference for plain text formats, this challenge certainly demonstrated that for most cases plain text is the appropriate format to use.
The binary representation is more difficult to extend and (at least in this example) required breaking changes to do so. This made any data written using the first version (prior to the introduction of the variable length directions
field) incompatible with any data written using the second version. A versioning scheme would need to be devised and represented within the binary format, for example using a pre-defined ‘header’ block of bytes to contain some metadata.
The plain text representation was simple to implement using standard, built in tools and was simple to extend. If the directions
field is deemed optional any data written in the first version is fully compatible with that of the second version. Converting the data would be a simple text transformation and could in fact be achieved directly in the shell using a tool such as jq. Here’s an example to add the directions
field, setting it to a default of null
:
$ cat data/address-book.json | jq 'map(. + {"directions": null})'
[
{
"first_name": "Fiona",
"last_name": "Power",
"phone_number": "01314960440",
"house_number": "91",
"street": "Sam fields",
"town": "North Shanebury",
"postcode": "M38 1FH",
"directions": null,
"id": "264bfab6-f1a5-4adc-a86b-28ae8e41817b"
},
{
"first_name": "Lorraine",
"last_name": "Richards",
"phone_number": "+448081570114",
"house_number": "9",
"street": "Ashleigh loaf",
"town": "North William",
"postcode": "M4H 5PW",
"directions": null,
"id": "b0b98056-c8ff-4b4e-a68b-b31e8ae43ac3"
},
...
]