Why is Datalog Good for Searching Code?
During the Summer of 2019 I got really interested in Datalog. It’s a language that is a syntactic subset of Prolog, with a few constraints that prevents the language from being Turing-Complete. As a result, you can’t use the language for most problems, but it is only really meant for querying a database of messy inter-connected facts of data. Because it’s not Turing Complete, every Datalog program is guaranteed to finish, and a query optimizer can do things behind-the-scenes to make queries run fast.
The original problem that got me interested in Datalog was this project to simplify the underlying code that powers Rust’s borrow checker. By describing the meat of the logic in Datalog, the actual code becomes much more high-level and aligns with the intent of the rules of borrow-checking, while at the same time becoming less prone to various kinds of bugs you can accidentally write when traversing complicated datastructures like internals that represent source code.
A year went by, and despite learning broadly about a lot of different logic languages, I haven’t managed to approach anything in my own work that requires that degree of high-level rule management with low-level concrete application.
Last month, Pete Vilter wrote about using Datalog as the interface to searching through your code. He demonstrated how concise it can be to implement type inference over a language using Datalog once your parser is able to produce facts for Datalog’s EDB.
While I hadn’t found any immediate practical reason to use Datalog or another logic language in any of my projects directly, the idea of Datalog as your method of searching through code is something I could use daily!
Today, most people use basic text search tools to find things in a codebase, whether it’s to find where something was defined, or to see all the places where a function gets called. For small questions of a codebase this is fine, but it has its limits. For example, it only works if your code is clean and consistent to find things. Additionally, it can only help you find the first step in a question that has multiple steps, like “Where are all the places this is called, and how many of them is this called with a string that wasn’t passed through the string sanitizing function?”
How Code Search Works Today
Let’s say you need to track down all of the possible routes in a service. For our service, all requests get sent through this large function body that has a large number of branches to route the request on to the various handlers.
if (request.methodLine == "AddTransaction") {
addTxn(request);
} else if (request.methodLine == "AuditTransaction") {
auditTxn(request);
} else if (request.methodLine == "AppendTransaction") {
appendTxn(request);
} else if (request.methodLine == 'SuspendTransaction') {
suspendTxn(request);
} else if (request.methodLine == "RestoreTransaction") {
restoreTxn(request);
} else if (request.methodLine == 'DeductTransaction') {
deductTxn(request);
} else if (request.methodLine == 'FlatterTransaction') {
flatterTxn(request);
} // ...
You can get a messy list of routes that are supported by the server with:
grep -R "methodLine\s*==\s*\"\w\+\""
Which says “find every line with a string being compared to ‘methodLine’”. This is pretty good! But it actually misses at least half of our routes. Why? Some of the routes use single quotes instead of double quotes. We could change our grep to deal with that:
grep -R "methodLine\s*==\s*['\"]\w\+['\"]"
Or we can clean up our code and demand everyone abide by The Standards when making code changes. It is tempting to demand discipline and rigorous conformity to arbitrary rules when there’s a systemic problem permeating a community, like using text to represent literally everything in a computer, but you find a new problem once you clean up all the code:
// ...
} else if (request.methodLine == 'DeductTransaction') {
deductTxn(request);
} else if (request.methodLine == 'FlatterTransaction') {
flatterTxn(request);
} else {
// use new macro + class-based request routing
/*
ROUTE(Foo) expands to:
if (request.methodLine = "Foo") {
new Foo(request);
}
*/
ROUTE(InsultTransaction);
ROUTE(ThreatenTransaction);
}
Someone wrote a macro to automate boilerplate for new command handlers. The code you are grepping for no longer even exists after a certain point. You could graft a new regex onto the old one, but it still isn’t great.
But we’re stuck with grep because text is all that’s accessible to us. While it’s possible to hook into your compiler’s API, nobody has the time to learn the compiler API to come up with a way to run the compiler all the way up to the AST stage, keep the results, then craft some graph traversal to look for the abstract pattern that represents the routing branches above. Instead we’d probably stitch together a couple of regexes, gathering results in some scratch files, then use a text editor to clean things up and stitch them all together to get a comprehensive list of every route.
Datalog for a Smarter Search
Switching to a Datalog-based workflow for querying code is exciting because it opens up the possibility to write that same search in a way that looks something like this:
routeEndpoint(RouteName) :-
equalityCheck(F, SLiteralNode),
getField(R, 'methodLine', F),
requestClass(R),
stringLiteral(SLiteralNode, RouteName).
You’d define that rule, then ask the engine for routeEndpoint(X)?
, and it’d
traverse the codebase looking for every instance where there’s an equality
check between two things
... :-
equalityCheck(F, SLiteralNode), ...
Those two things are given names in the rule, F
and SLiteralNode
. These
names don’t have a semantic value associated with them yet, but we establish
what they mean next. The F
variable is the result of getField
, which would
come from the compiler as it builds up the AST. Put together, the two predicates
...
getField(R, 'methodLine', F),
requestClass(R),
...
represent the .methodLine
part of the check request.methodLine ==
"AddTransaction"
Next we define what SLiteralNode
means in the equalityCheck(F, SLiteralNode)
predicate with
...
stringLiteral(SLiteralNode, RouteName).
Our parser gives us something called stringLiteral(Node,
ValueInsideStringLiteral)
. Which represents the String object
"AddTransaction"
in the code request.methodLine == "AddTransaction"
, and
gives us a way to pluck the value of that string literal out.
So now that all of the terms we are searching for are defined, the Datalog engine would look for matches. Instead of scanning through messy text, we are searching for specific patterns that happen in the codebase’s abstract syntax tree. We avoid whitespace differences, or single/double-quote issues because we’re now dealing with the high-level datastructure that represents the actual code itself instead of the text of the file. Additionally, our macro-problem goes away because the macro is expanded into the same syntax tree format as our normal text code, so it can find instances that are hard to search for because of our shyness to boilerplate in our text.
Semantic Search Adapts to Change
This is already a huge improvement over raw text searching, but Datalog is also better at adapting to code refactors.
Let’s say someone starts to refactor our code to use constants for those routes. If they kept the constants in a macro, it’d expand to the same string literals and, depending on our underlying foundation of Datalog facts nothing would need to change. However, if they started using String objects instead of string literals, we could add a rule to catch that.
const string ADDTRANSACTION = "AddTransaction;
const string AUDITTRANSACTION = "AuditTransaction;
// TODO: move more routes to constants
if (request.methodLine == ADDTRANSACTION) {
addTxn(request);
} else if (request.methodLine == AUDITTRANSACTION) {
auditTxn(request);
} else if (request.methodLine == "AppendTransaction") {
appendTxn(request);
} else if (request.methodLine == 'SuspendTransaction') {
suspendTxn(request);
} else if (request.methodLine == "RestoreTransaction") {
...
We can change our predicate for matching string literals to a rule that handles both cases:
% multi-headed rules let us match in more than one case
stringValue(S, V) :- getField(StrObj, 'value', V), stringClass(StrObj).
stringValue(S, V) :- StringLiteral(ASTNode, V).
% Our original search rule, with the stringLiteral replaced for stringValue
routeEndpoint(RouteName) :-
equalityCheck(F, S),
getField(R, 'methodLine', F),
requestClass(R),
stringValue(S, RouteName).
Adding two definitions for stringValue(S, V)
lets us express the string value
inside a string object, both for string object and string literals.
We update routeEndpoint(RouteName)
to use stringValue
and now this search
will continue to work during the partial refactor.
Text was a Bad Choice
The reason why I’m excited about this idea is that while grep gets you a long way, it requires your code to be extremely consistent and never change when you need to answer certain questions about your codebase, and it can only give you answers as a list of lines of text.
This is good enough if you can stop there, but if you need to take those lines of text and do a second step of searching, you now have to deal with parsing those results into a new script that goes off and performs more searches for every line.
Answering Harder Questions with Datalog
In my case, I want to find routes without test coverage. Remaining in this text-in-text-out pipeline seems acceptable if it’s the only way you’ve ever dealt with code, but it trades consistency in one part of the problem (“This is easy because everything is text”) for an enormous amount of complexity literally everywhere else (“I have text inside my text that represents another layer of text so after I escape the escaped text I need to unescape it to feed it into…”).
For my goal of finding test coverage, staying in Datalog is actually better
because my routeEndpoint(RouteName)
rule cleanly chains into other rules I can
write! I can write a rule that finds a test that contains code hitting the
server with a particular route:
mentionsRoute(Test, RouteName) :-
pythonTest(Test),
pythonFunction(Test, FunctionObject),
lineInFunction(FunctionObject, Line),
expressionsInLine(Line, Expression),
methodCallExpression(Expression, M),
methodCall(M, AnyObj, 'sendRequest', RouteName).
These predicate facts are kind of hand-wavy, but you could imagine a parser could read in the text for a function, and it would produce facts for every python function to describe every line and every expression that’s in every line, reproducing the AST. Once we have a rule that can go from every test and tell us a route that’s found in a test, we can find every route that needs a test.
routesThatNeedTests(RouteName) :-
routeEndpoint(RouteName),
not mentionsRoute(_, RouteName).
The above rule would find every route, and find every test of a route, and then tell you which routes don’t appear in the set of tests.
Speed Up Continuous Integration with Datalog
You can also build queries that help you in other ways:
% imagine we've written something to parse the diff for files
% and line numbers
linesChangedBetweenGitRevisions(OldCommit, NewCommit, Line) :- ...
% assuming AST nodes link back to their files and line numbers,
% this rule is a typical vertex reachability query in Datalog
routeTouchesLine(Route, Line) :- ...
routesAffectedByDiff(OldCommit, NewCommit, Route) :-
routeTouchesLine(Route, Line),
linesChangedBetweenGitRevisions(OldCommit, NewCommit, Line).
% this is the rule you can use to speed up CI, only run tests that need to be run!
testAffectedByDiff(OldCOmmit, NewCommit, Test) :-
routesAffectedByDiff(OldCOmmit, NewCommit, Route),
mentionsRoute(Test, Route).
So when a PR comes in, or when commits are pushed onto master, your CI tool should know how to get the old and new git commit ids, and it can get a list of tests to run with
testsAffectedByDiff('abc4f3', 'ccd333', Test)?
As long as it gives you your list of tests in the right format, you can pass it to your test runner, like
datalog_search "testsAffectedByDiff('abc4f3', 'ccd333', Test)?" \
| xargs pytest -v