!python3 -V
Python 3.8.10
September 16, 2023
In the previous entry, we found connections between different Python scripts manually. We iterated through lines of code with a simple check for lines that began with something like from x import
or import x
. This works fine for getting an idea of module dependencies, but if we begin to ask questions like “what functions are imported” and “what are our dependencies per function” then the complexity begins to increase.
For example, if we want to track which functions are called in a script (how often did I use a deprecated function that needs to be replaced in the next version?) we would have to:
from x import y, z
y
and z
are usedx.func1()
If we also want to track the dependencies of each of our functions independently (rather than for the entire script) then this would involve systematically tracking when you are inside of a function definition.
At some point in going through code, keeping a careful log of what has been defined, and differentiating between being in a function definition or in a comment, you’re building a parser. This is fun, but it can be difficult to do correctly. Luckily, Python has a built-in parser you can use: ast
for Abstract Syntax Trees. This will build a syntax tree of Python code and allow us to crawl it to find the function dependencies we’re looking for.
Goals:
ast
libraryast
First, here’s what Python version I’m using (syntax can change from version to version, so this may not work exactly the same):
Here is some example code:
To use ast
, begin by reading in the content of a file and then use ast.parse
:
This produces a nested syntax tree. Here the entire module is the top level:
Then its .body
has the contents, and import line and some expression:
The nested elements keeps going until you hit the lowest levels of names of variables. You can see the full collection of elements with .walk
to walk all paths of the syntax tree:
These are all the nodes:
[<_ast.Module at 0x7ffb14597430>,
<_ast.ImportFrom at 0x7ffb1459ce80>,
<_ast.Expr at 0x7ffb1454f430>,
<_ast.alias at 0x7ffb1454f340>,
<_ast.Call at 0x7ffb1454f550>,
<_ast.Name at 0x7ffb1454f580>,
<_ast.Load at 0x7ffb1815c0d0>]
Some important ones for our purposes are Module
, ImportFrom
, and Call
.
First is a node for a Python module that we saw at the top level of the object returned from .parse
:
The ImportFrom
node is specifically for imports structured like from x import y
.
We can get the module name
and a list of the imported functions
whose names we can access with the .name
attribute
This tells us that we imported do_stuff
from worker
, even if it doesn’t specifically tell us the context of it (where did we use the function if at all).
Function calls are in Call
nodes:
We can get the arguments of the function, the line number where the call was made, and the name of the function being called:
Use a function to generate a walked list of AST elements given a script name:
from helper_1 import greeting
from helper_2 import name
def do_stuff():
print(f"{greeting()}, {name()}!")
[<_ast.Module at 0x7ffb144f1c70>,
<_ast.ImportFrom at 0x7ffb144f1fd0>,
<_ast.ImportFrom at 0x7ffb144f7310>,
<_ast.FunctionDef at 0x7ffb144f73a0>,
<_ast.alias at 0x7ffb144f7520>,
<_ast.alias at 0x7ffb144f74f0>,
<_ast.arguments at 0x7ffb144f73d0>,
<_ast.Expr at 0x7ffb144f7370>,
<_ast.Call at 0x7ffb144f71c0>,
<_ast.Name at 0x7ffb144f7160>,
<_ast.JoinedStr at 0x7ffb144f7400>,
<_ast.Load at 0x7ffb1815c0d0>,
<_ast.FormattedValue at 0x7ffb144f7430>,
<_ast.Constant at 0x7ffb144f7490>,
<_ast.FormattedValue at 0x7ffb144f7550>,
<_ast.Constant at 0x7ffb144f75e0>,
<_ast.Call at 0x7ffb144f72e0>,
<_ast.Call at 0x7ffb144f7580>,
<_ast.Name at 0x7ffb144f7460>,
<_ast.Name at 0x7ffb144f75b0>,
<_ast.Load at 0x7ffb1815c0d0>,
<_ast.Load at 0x7ffb1815c0d0>]
If we start at the top, the module node tells us this code consists of two import lines and a function definition:
[<_ast.ImportFrom at 0x7ffb144f1fd0>,
<_ast.ImportFrom at 0x7ffb144f7310>,
<_ast.FunctionDef at 0x7ffb144f73a0>]
The names of the imported modules:
We can begin to see how to parse for module and function dependency information. We can used the walked syntax tree and record information about module imports, function calls, and function definitions. Note, if we are crawling the children of FunctionDef
then the function being defined has those children as dependencies.
Part of the power of ast
is that our dependency parser is more robust to code that is less well-formatted.
from helper_1 import greeting
def do_stuff():
print(f"{greeting()}, {name()}!")
from helper_2 import name
The only thing that changes is the order in the body of the module node, the second import is now after the function definition (but not a child of it) and this information is preserved:
At the end of this article, there are details of more complex parsing cases, but let’s jump into using the parser:
import numpy as np
z = np.zeroes(5)
def mul():
a = np.array([[1, 0],
[0, 1]])
b = np.array([[4, 1],
[2, 2]])
return np.matmul(a, b)
def eigs_of_product():
a = np.array([[1, 0],
[0, 1]])
b = np.array([[4, 1],
[2, 2]])
product = np.matmul(a, b)
eigs = np.linalg.eigs(product)
np.linalg.debug.depth.error_print(eigs) # this is a fake call, comment out if you want to run this
return eigs
def get_submodule_desc(value):
# this helps us handle submodules like the call of `np.linalg.eig` which is using the `linalg` submodule of `numpy`
module_call = []
# the submodules and functions will have different types
if isinstance(value, ast.Attribute):
module_call.append(value.attr)
module_call.extend(get_submodule_desc(value.value)) # since this was a submodule, keep parsing
elif isinstance(value, ast.Name):
module_call.append(value.id) # finally hit the function
return module_call
work_w = walk_script("example/abyss.py") # 1. create the syntax tree and 2. walk the nodes
calls = [n for n in work_w if isinstance(n, ast.Call)] # 3. perform a type check (to only get function calls)
for call in calls:
# 4. extract information from each call node
function = call.func
function_name = function.attr
value = function.value
submodule_desc = get_submodule_desc(value)
submodule_desc.reverse() # we ended up parsing backwards
print(function_name, ".".join(submodule_desc)) # returning the function name and parent module
zeroes np
array np
array np
matmul np
array np
array np
matmul np
eigs np.linalg
error_print np.linalg.debug.depth
So we have a list of functions called in the script along with their home modules.
Do something like the above for each .py
file in a directory you want to map out.
We have a simple utility to get all of the Python filenames from a collection of directories, along with separately specified filenames.
[PosixPath('example/main.py'),
PosixPath('example/helper_2.py'),
PosixPath('example/worker_difficult.py'),
PosixPath('example/worker.py'),
PosixPath('example/worker_more_difficult.py'),
PosixPath('example/helper_1.py'),
PosixPath('example/abyss.py'),
PosixPath('example/sub/one.py'),
PosixPath('example/sub/subsub/two.py'),
PosixPath('example2/lo.py'),
PosixPath('test.py')]
We’ll do a deeper dive in the next article, but we can create a module information dictionary by parsing and extracting information from each script.
The module names:
dict_keys(['main', 'helper_2', 'worker_difficult', 'worker', 'worker_more_difficult', 'helper_1', 'abyss', 'one', 'two', 'lo', 'test'])
For each module, we provide the list of imports, function calls, and function definitions:
Then our extracted information looks something like this:
{'call_list': [CallNode(module=[], name='print', call_lineno=5, called_by='do_stuff'),
CallNode(module=[], name='greeting', call_lineno=5, called_by='do_stuff'),
CallNode(module=[], name='name', call_lineno=5, called_by='do_stuff')],
'func_defs': [FuncDefNode(name='do_stuff', module='worker', start_lineno=4, end_lineno=5, calls=[CallNode(module=[], name='print', call_lineno=5, called_by='do_stuff'), CallNode(module=[], name='greeting', call_lineno=5, called_by='do_stuff'), CallNode(module=[], name='name', call_lineno=5, called_by='do_stuff')])],
'import_list': [ImportNode(module='helper_1', function_names=['greeting'], level=0, alias=''),
ImportNode(module='helper_2', function_names=['name'], level=0, alias='')]}
We have the function calls that are made, tracked by where they are called. We also have a log of the imported modules as well as the functions defined in a module.
The module info dictionary m_info
contains all the information we need to build up a code graph.
For each function we can use the call list to create from functions to their parent modules:
[CallNode(module=['np'], name='zeroes', call_lineno=3, called_by=None),
CallNode(module=['np'], name='array', call_lineno=6, called_by='mul'),
CallNode(module=['np'], name='array', call_lineno=8, called_by='mul'),
CallNode(module=['np'], name='matmul', call_lineno=10, called_by='mul'),
CallNode(module=['np'], name='array', call_lineno=13, called_by='eigs_of_product'),
CallNode(module=['np'], name='array', call_lineno=15, called_by='eigs_of_product'),
CallNode(module=['np'], name='matmul', call_lineno=17, called_by='eigs_of_product'),
CallNode(module=['np', 'linalg'], name='eigs', call_lineno=18, called_by='eigs_of_product'),
CallNode(module=['np', 'linalg', 'debug', 'depth'], name='error_print', call_lineno=19, called_by='eigs_of_product')]
[('mul', 'np.array'),
('mul', 'np.array'),
('mul', 'np.matmul'),
('eigs_of_product', 'np.array'),
('eigs_of_product', 'np.array'),
('eigs_of_product', 'np.matmul'),
('eigs_of_product', 'np.linalg.eigs'),
('eigs_of_product', 'np.linalg.debug.depth.error_print')]
We can see that some functions are called multiple times in the same definition, so we compress this information while keeping track of the number of calls:
Finally, we create mermaid graph descriptions of code.
Note, had to comment these out because my site generator became very confused about these markdown prints!
In the next installment we’ll go into the details about how to actually create this parser with a final repo for creating a function dependency graph.
There are many complex cases that need to be handled to perform a thorough analysis. For instance, when the imports are not all tidily placed at the top and also:
from x import (a, b, c)
import numpy as np
means we have to track np
in the codeHere we have examples of all the above:
from helper_1 import greeting
from helper_3 import (a, b, c, d)
from collections import *
def do_stuff():
import re
import talk as tk
print(f"{greeting()}, {name()}!")
from helper_2 import name, crab
from ..test import say_hello
import sub.one
import sub.subsub.two
from sub.one import cat
from sub.subsub.two import dog
we catch the imports outside of function definitions as before
[<_ast.ImportFrom at 0x7ffb1452b670>,
<_ast.ImportFrom at 0x7ffb1452bf40>,
<_ast.ImportFrom at 0x7ffb1452ba00>,
<_ast.FunctionDef at 0x7ffb144b5040>,
<_ast.ImportFrom at 0x7ffb144b5520>,
<_ast.ImportFrom at 0x7ffb144b55e0>,
<_ast.Import at 0x7ffb144b5670>,
<_ast.Import at 0x7ffb144b5730>,
<_ast.ImportFrom at 0x7ffb144b57f0>,
<_ast.ImportFrom at 0x7ffb144b5880>]
When we have more than one function being imported:
Importing everything from a module
We can see this is going to be a problem for us to solve. We have no real way of logging the functions we’re bringing in without parsing the entire module:
We can use the .level
attribute to determine that something was a relative import (from ..test import say_hello
):
level
is an integer holding the level of the relative import (0 means absolute import)
Import
NodeThere is actually a distinct node type for module imports with the simpler import x
syntax.
The unfortunate parts is that this does not have the level
attribute, so we will have to manually parse the module names to determine when things were subdirectory imports:
Our function definition include two inline import statments:
As with the module node, we can see these imports in the body of the function definition node:
[<_ast.Import at 0x7ffb144b50d0>,
<_ast.Import at 0x7ffb144b5160>,
<_ast.Expr at 0x7ffb144b5280>]
Next we need to better understand how to unpack the information in a Call
node.
import numpy as np
z = np.zeroes(5)
def mul():
a = np.array([[1, 0],
[0, 1]])
b = np.array([[4, 1],
[2, 2]])
return np.matmul(a, b)
def eigs_of_product():
a = np.array([[1, 0],
[0, 1]])
b = np.array([[4, 1],
[2, 2]])
product = np.matmul(a, b)
eigs = np.linalg.eigs(product)
np.linalg.debug.depth.error_print(eigs) # this is a fake call, comment out if you want to run this
return eigs
We build a counter of node types:
From this (comparing how many times we used np
in the script to these counts), we can tell that the information about the numpy
calls are found in nodes with type _ast.Call
Counter({_ast.Module: 1,
_ast.Import: 1,
_ast.Assign: 7,
_ast.FunctionDef: 2,
_ast.alias: 1,
_ast.Name: 23,
_ast.Call: 9,
_ast.arguments: 2,
_ast.Return: 2,
_ast.Expr: 1,
_ast.Store: 7,
_ast.Attribute: 13,
_ast.Constant: 17,
_ast.Load: 41,
_ast.List: 12})
So, we grab just the calls like this:
[<_ast.Call at 0x7ffb144f1b20>,
<_ast.Call at 0x7ffb144f1be0>,
<_ast.Call at 0x7ffb144f1490>,
<_ast.Call at 0x7ffb144f10a0>,
<_ast.Call at 0x7ffb1459cfd0>,
<_ast.Call at 0x7ffb1454fd30>,
<_ast.Call at 0x7ffb1454f910>,
<_ast.Call at 0x7ffb1458a760>,
<_ast.Call at 0x7ffb1458a580>]
This is the basic structure of the call data:
Nested calls (using a submodule, like when we use np.linalg.eigs
) require different handling
As do deeply nested calls (np.linalg.debug.depth.error_print(eigs)
)
you just keep grabbing .value
until the type changes from ast.Attribute
to ast.Name
In some of these examples we can begin to see an issue we’ll need to resolve later. Really, rather than use the walked list of nodes in the AST, we want to recursively walk from the top node and also use recursive calls to extract data from the nodes. More on that in the next article.