Reading a lot of code has made me a better engineer. I'm going to try to record here some of the techniques for reading code effectively that I've picked up over the years.
I'm guessing that most who attempt it quickly realize that reading code like reading a novel is a good way to get discouraged. A novel is broken up into chapters which are meant to be read through linearly. Code is generally structured non-linearly, so it requires a different set of approaches.
Sometime in the future, I'll write a post about how to write
readable code, but today I want to focus on reading code. So in the
likely scenario that it's a long time before I get around to writing
that post, rest assured that knowing how to read code will almost certainly make you better at writing code.
These suggestions are colored by my experience with python and javascript, but for the most part I think that these techniques
are fairly language agnostic.
General tips
-
Don't start with the code. A first bit of advice on reading code: for the most part, don't! Start by reading the docs. While fun and interesting, reading code
is often much less efficient than reading documentation. Documentation
will often give you the high level foundation you need to understand the
code, should you choose to go deeper. For instance, before you dive
into the git git mirror, read the docs on git internals!
-
Read code like a computer executes code. Start at entrypoints
(function calls, CLI commands, main routines) and follow references.
Make abundant use of
grep
, ctrl-F
, and GitHub
search. This will help you be a bit more agnostic to the folder and file
structures, which are often rather arbitrary and can be misleading.
I've cumulatively wasted hours clicking or cd
and ls
-ing through folder structures looking for some entrypoint or another
only to remember eventually that I could have just searched.
-
Follow contributing guides. Many projects have contributing guides
which will show you how to get a local version up and how to run the
tests, which is a great way to discover entrypoints.
-
Know the language. Confusion about syntax will get in the
way of clearer understanding. Sometimes the structure of the language
can have an effect on how code is organized (e.g.,
__init__.py
files or *.h
files). Also remember that you can also learn a lot even if you don't understand every bit of syntax you encounter.
-
Start high level. Skimming quickly to understand interfaces will help you see
the big picture before you get lost in the details. Be willing to say, "I'll come back to that function - for now I'll just trust that it does what it says it does".
-
Think about why it's structured the way it is. Code is written in the way it is for a reason. If you keep this in mind even when you are reading poorly structured code, it will be easier to understand the authors' intent.
Reviewing your own code
Taking a step back and reviewing your own code is a great way improve the way you write software.
I took a writing class in college and the main concept they were
trying to drill into us was the value of revisions. We would have to
write and submit the same essay multiple times - first draft, second
draft, final draft - with weeks of revision and discussion in between.
Each time, the essay would improve. All too often, the thesis statement
an author starts with in the introduction is not the thesis statement
the author ends with in the conclusion. When that happens, authors who
know what they are doing rewrite the essay with the new thesis.
This happens in code as well. As a reviewer of your own code, you must be willing and eager
to delete code. Do not be a victim of the sunk-cost fallacy and get too
attached to code you've written.
If this is a struggle, deleting code becomes much easier when a thorough
test suite is in place, which can give you confidence you need to
refactor regularly. All else equal, less code is better than more code.
The process of reviewing your own code with a critical eye can help
you find opportunities for abstractions that you missed on the most
recent draft. I often start by looking for repeated patterns or by
talking directly to the users of the abstractions and interfaces I
wrote, about which ones were most useful or confusing.
I find it productive to try to answer some of the following questions
- What are the main data structures and abstractions I'm using in this code?
- How could data structures or abstractions be changed to enable this
project to be simpler, more precise and easier to contribute to?
- Have I written enough tests to know that I can refactor without fear?
- Would it be easy for others to contribute to or maintain this code?
In other words, could they make changes without fear of breaking things?
Reviewing contributions from others
Reviewing contributions made by others is an important part of
building a project. This is a big topic, so I'll just mention one key
point here that has made my code review astronomically more effective.
Use a checklist!
Whenever possible, this checklist should be in version control so
that it is known by all contributors. Sometimes it makes sense to put
this into a CONTRIBUTING
file, and if using GitHub, it makes sense to put this in a .github/PULL_REQUEST_TEMPLATE.md
. The latter is what I use in the eemeter library.
I find it very convenient that GitHub auto-populates the descriptions
of all new pull requests using that template because it helps
contributors be proactive.
Using a checklist should not add a burden to contributors or
reviewers. If it is created with care, it should make it as easy as
possible for contributors to comply with contributing guidelines.
Consider including at least the following in such a checklist for code
review:
- Does the code conform to the style guide? There are many options for
automated code style checking available for many different languages.
Some are configurable, some aren't, some just point out issues, some can
proactively fix them. Providing instructions here about how to run the
automated style checker makes it easy for contributors to write code
that conforms. Consider how this could save you from wasting valuable
review time and effort discussing the minutia of style-guide
conformance.
- Did the contributor run the existing test suites and add their own?
Similarly, this should proactively give instructions to the contributor
about how to run the existing automated test suite.
- Did the contributor follow the correct branching, committing, and
merging procedures? The checklist should point out where to find
instructions for properly executing these procedures.
- Did the contributor add appropriate documentation and changelog
entries describing their work? Instructions for how to build or
contribute to documentation would be appropriate here.
Using this checklist will free you to focus on the highest-value
review criteria as you look through the code diff or the branches you're
comparing.
- Is the code written in a way that takes advantage of appropriate existing abstractions?
- Does the code introduce any new complexity?
- Are the existing interfaces respected to ensure backwards compatibility, if necessary?
Reading code because the documentation doesn't cut it
Reading code to learn how to use it is often a last resort after
you've read the documentation and come up dry. Maybe you are trying to
extend the library and you want to look at some examples of how to use
the base class that the library implements.
Let me illustrate with an example. I make heavy use of the Django REST Framework python library for, as you may have guessed, writing REST APIs for Django. It has excellent documentation, and it also has a very readable code base.
The library documentation for "Concrete View Classes"
describe very clearly the methods provided by those classes and the
classes from which they inherit. This documentation is helpful and
describes exactly how to use these classes.
The following classes are the concrete generic views. If you're
using generic views this is normally the level you'll be working at
unless you need heavily customized behavior.
The view classes can be imported from rest_framework.generics.
CreateAPIView
- Used for create-only endpoints.
- Provides a post method handler.
- Extends: GenericAPIView, CreateModelMixin
ListAPIView
- Used for read-only endpoints to represent a collection of model instances.
- Provides a get method handler.
- Extends: GenericAPIView, ListModelMixin
These concrete view classes are available in specific, commonly used
configurations. Looking at the code for these view classes shows exactly the same picture,
but makes it also clear that these classes are tiny and very simple
mappings from HTTP methods to model-related actions, which makes it more
obvious, in my opinion, how to use and extend these classes, and how to
learn more about what they do.
# Concrete view classes that provide method handlers
# by composing the mixin classes with the base view.
class CreateAPIView(mixins.CreateModelMixin,
GenericAPIView):
"""
Concrete view for creating a model instance.
"""
def post(self, request, *args, **kwargs):
return self.create(request, *args, **kwargs)
class ListAPIView(mixins.ListModelMixin,
GenericAPIView):
"""
Concrete view for listing a queryset.
"""
def get(self, request, *args, **kwargs):
return self.list(request, *args, **kwargs)
In
this case, the code is almost exactly the same length as the
documentation itself, which makes for a pretty good insight-to-effort
ratio.
Reading code to learn how to imitate it
This final category of reading code is, in my opinion, the most
difficult, but also the most interesting. I have learned a ton reading
through the code of libraries or tools that I frequently use, or which
are written by developers whose work I admire.
For example:
- Reading through the
requests
library taught me that the main API
is organized around a single function (request) with a consistent
interface, for which the get/post/patch/put/etc... methods are a really
simple wrapper. In retrospect this makes a lot of sense, as they must
internally share a lot of logic. It's also really intersting that the
get/post/patch/put/etc... methods exist at all - the immediacy,
intuitiveness, and discoverability of the interace matters a lot for
usability.
- Poking around in the
pandas
library test suite
showed me a bunch of examples of tests, and how useful it can be to
write specialized functions specifically for testing that help to
capture a ton of repeated testing logic.
- Trying to figure out something with timezones a while back I was led
to the pytz package source code. I mention it here because it is the
most unusual python package I have ever had the good fortune of looking
through. But it is extraordinarily useful and tons of popular packages
rely on it. It's so ubiquitous that I was initially surprised to learn
that it wasn't a built in package. Here it is in all its glory, mirrored on GitHub.
- Curiosity about how the cPython sort function worked internally lead me to find the source code and also this fascinating text file describing it as a "timsort". That's the same Tim (Peters) who wrote the Zen of Python, which I find quite inspiring.
Poking through libraries (I wish I could remember which ones) when I was just starting out writing Python showed me:
- How to use
setup.py
for packaging python projects
- That
pytest
with tox
is a pretty popular test runner configuration
- That the convention for internal functions is to put an underscore before the function name
- That the package
six
was used a lot for python 2/3 compatibility
- That sphinx is a really popular way of building documentation
- That it's pretty common to do a bunch of
from module import *
in __init__.py
files specifically
- That decorators exist and metaclasses exist
Reading code to learn from it is like doing a puzzle. Or like doing
an orienteering course. Or like reading a history book. Or like going
down a rabbit hole. Or like deep sea fishing. You may also be trawling
for unfamiliar concepts, or for useful tools, or for more effective
processes. All of these fish may be found in the vast ocean of open
source repositories waiting to be explored.