ψML - Graph My Code 1: Creating a Graph of Function Dependencies in Python

Intro

In the previous entry, we found connections between different Python scripts manually. We iterated through lines of code with a simple check for lines that began with something like from x import or import x. This works fine for getting an idea of module dependencies, but if we begin to ask questions like “what functions are imported” and “what are our dependencies per function” then the complexity begins to increase.

For example, if we want to track which functions are called in a script (how often did I use a deprecated function that needs to be replaced in the next version?) we would have to:

take note of which functions are called in from the initial imports from x import y, z
go through the rest of the code line by line and find when y and z are used
find when functions of a library are called using notation like x.func1()

If we also want to track the dependencies of each of our functions independently (rather than for the entire script) then this would involve systematically tracking when you are inside of a function definition.

At some point in going through code, keeping a careful log of what has been defined, and differentiating between being in a function definition or in a comment, you’re building a parser. This is fun, but it can be difficult to do correctly. Luckily, Python has a built-in parser you can use: ast for Abstract Syntax Trees. This will build a syntax tree of Python code and allow us to crawl it to find the function dependencies we’re looking for.

Goals:

explore the ast library
use it to find function dependencies in code
visualize these dependencies with a graph

Explore `ast`

First, here’s what Python version I’m using (syntax can change from version to version, so this may not work exactly the same):

!python3 -V

Python 3.8.10

import ast

Here is some example code:

!cat example/main.py

from worker import do_stuff

do_stuff()

Parse

To use ast, begin by reading in the content of a file and then use ast.parse:

example_script = open("example/main.py", "r")
parsed_script = ast.parse(example_script.read())

This produces a nested syntax tree. Here the entire module is the top level:

parsed_script

<_ast.Module at 0x7ffb14597430>

Then its .body has the contents, and import line and some expression:

parsed_script.body

[<_ast.ImportFrom at 0x7ffb1459ce80>, <_ast.Expr at 0x7ffb1454f430>]

Walk

The nested elements keeps going until you hit the lowest levels of names of variables. You can see the full collection of elements with .walk to walk all paths of the syntax tree:

walked_script = ast.walk(parsed_script)
w_l = list(walked_script)

These are all the nodes:

w_l

[<_ast.Module at 0x7ffb14597430>,
 <_ast.ImportFrom at 0x7ffb1459ce80>,
 <_ast.Expr at 0x7ffb1454f430>,
 <_ast.alias at 0x7ffb1454f340>,
 <_ast.Call at 0x7ffb1454f550>,
 <_ast.Name at 0x7ffb1454f580>,
 <_ast.Load at 0x7ffb1815c0d0>]

Some important ones for our purposes are Module, ImportFrom, and Call.

Module Node

First is a node for a Python module that we saw at the top level of the object returned from .parse:

module_node = w_l[0]

module_node.body

[<_ast.ImportFrom at 0x7ffb1459ce80>, <_ast.Expr at 0x7ffb1454f430>]

ImportFrom Node

The ImportFrom node is specifically for imports structured like from x import y.

import_from_node = w_l[1]

We can get the module name

import_from_node.module

'worker'

and a list of the imported functions

import_from_node.names

[<_ast.alias at 0x7ffb1454f340>]

whose names we can access with the .name attribute

import_from_node.names[0].name

'do_stuff'

This tells us that we imported do_stuff from worker, even if it doesn’t specifically tell us the context of it (where did we use the function if at all).

Call Nodes

Function calls are in Call nodes:

call_node = w_l[-3]
call_node

<_ast.Call at 0x7ffb1454f550>

We can get the arguments of the function, the line number where the call was made, and the name of the function being called:

call_node.args

[]

call_node.lineno

call_node.func.id

'do_stuff'

Start Crawling Code

Use a function to generate a walked list of AST elements given a script name:

def walk_script(filename):
    worker_script = open(filename, "r")
    parsed_worker = ast.parse(worker_script.read())
    walked_worker = ast.walk(parsed_worker)
    work_w = list(walked_worker)
    return work_w

work_w = walk_script("example/worker.py")

!cat example/worker.py

from helper_1 import greeting
from helper_2 import name

def do_stuff():
    print(f"{greeting()}, {name()}!")

work_w

[<_ast.Module at 0x7ffb144f1c70>,
 <_ast.ImportFrom at 0x7ffb144f1fd0>,
 <_ast.ImportFrom at 0x7ffb144f7310>,
 <_ast.FunctionDef at 0x7ffb144f73a0>,
 <_ast.alias at 0x7ffb144f7520>,
 <_ast.alias at 0x7ffb144f74f0>,
 <_ast.arguments at 0x7ffb144f73d0>,
 <_ast.Expr at 0x7ffb144f7370>,
 <_ast.Call at 0x7ffb144f71c0>,
 <_ast.Name at 0x7ffb144f7160>,
 <_ast.JoinedStr at 0x7ffb144f7400>,
 <_ast.Load at 0x7ffb1815c0d0>,
 <_ast.FormattedValue at 0x7ffb144f7430>,
 <_ast.Constant at 0x7ffb144f7490>,
 <_ast.FormattedValue at 0x7ffb144f7550>,
 <_ast.Constant at 0x7ffb144f75e0>,
 <_ast.Call at 0x7ffb144f72e0>,
 <_ast.Call at 0x7ffb144f7580>,
 <_ast.Name at 0x7ffb144f7460>,
 <_ast.Name at 0x7ffb144f75b0>,
 <_ast.Load at 0x7ffb1815c0d0>,
 <_ast.Load at 0x7ffb1815c0d0>]

If we start at the top, the module node tells us this code consists of two import lines and a function definition:

work_w[0]

<_ast.Module at 0x7ffb144f1c70>

work_w[0].body

[<_ast.ImportFrom at 0x7ffb144f1fd0>,
 <_ast.ImportFrom at 0x7ffb144f7310>,
 <_ast.FunctionDef at 0x7ffb144f73a0>]

The names of the imported modules:

work_w[0].body[0].module

'helper_1'

work_w[0].body[1].module

'helper_2'

We can begin to see how to parse for module and function dependency information. We can used the walked syntax tree and record information about module imports, function calls, and function definitions. Note, if we are crawling the children of FunctionDef then the function being defined has those children as dependencies.

Part of the power of ast is that our dependency parser is more robust to code that is less well-formatted.

!cat example/worker_difficult.py

from helper_1 import greeting

def do_stuff():
    print(f"{greeting()}, {name()}!")

from helper_2 import name

work_w = walk_script("example/worker_difficult.py")

The only thing that changes is the order in the body of the module node, the second import is now after the function definition (but not a child of it) and this information is preserved:

work_w[0].body

[<_ast.ImportFrom at 0x7ffb14505730>,
 <_ast.FunctionDef at 0x7ffb14505700>,
 <_ast.ImportFrom at 0x7ffb14505c70>]

Constructing the Dependency Parser

At the end of this article, there are details of more complex parsing cases, but let’s jump into using the parser:

create the syntax tree,
walk the nodes,
perform a type check on each node and
extract the information from the node depending on the type

!cat example/abyss.py

import numpy as np

z = np.zeroes(5)

def mul():
    a = np.array([[1, 0],
                  [0, 1]])
    b = np.array([[4, 1],
                  [2, 2]])
    return np.matmul(a, b)

def eigs_of_product():
    a = np.array([[1, 0],
                  [0, 1]])
    b = np.array([[4, 1],
                  [2, 2]])
    product = np.matmul(a, b)
    eigs = np.linalg.eigs(product)
    np.linalg.debug.depth.error_print(eigs)  # this is a fake call, comment out if you want to run this
    return eigs

def get_submodule_desc(value):
    # this helps us handle submodules like the call of `np.linalg.eig` which is using the `linalg` submodule of `numpy`
    module_call = []
    # the submodules and functions will have different types
    if isinstance(value, ast.Attribute):
        module_call.append(value.attr)
        module_call.extend(get_submodule_desc(value.value))  # since this was a submodule, keep parsing
    elif isinstance(value, ast.Name):
        module_call.append(value.id)  # finally hit the function
    return module_call

work_w = walk_script("example/abyss.py")  # 1. create the syntax tree and 2. walk the nodes
calls = [n for n in work_w if isinstance(n, ast.Call)]  # 3. perform a type check (to only get function calls)
for call in calls:
    # 4. extract information from each call node
    function = call.func
    function_name = function.attr
    value = function.value
    submodule_desc = get_submodule_desc(value)
    submodule_desc.reverse()  # we ended up parsing backwards
    print(function_name, ".".join(submodule_desc))  # returning the function name and parent module

zeroes np
array np
array np
matmul np
array np
array np
matmul np
eigs np.linalg
error_print np.linalg.debug.depth

So we have a list of functions called in the script along with their home modules.

Extract Code Info

Do something like the above for each .py file in a directory you want to map out.

from code_extraction import *

We have a simple utility to get all of the Python filenames from a collection of directories, along with separately specified filenames.

get_all_filenames(directories=["example", "example2"], other_python_filenames=["test.py"])

[PosixPath('example/main.py'),
 PosixPath('example/helper_2.py'),
 PosixPath('example/worker_difficult.py'),
 PosixPath('example/worker.py'),
 PosixPath('example/worker_more_difficult.py'),
 PosixPath('example/helper_1.py'),
 PosixPath('example/abyss.py'),
 PosixPath('example/sub/one.py'),
 PosixPath('example/sub/subsub/two.py'),
 PosixPath('example2/lo.py'),
 PosixPath('test.py')]

We’ll do a deeper dive in the next article, but we can create a module information dictionary by parsing and extracting information from each script.

m_info = extract_code_information(directories=["example", "example2"], other_python_filenames=["test.py"])

from pprint import pprint

The module names:

m_info.keys()

dict_keys(['main', 'helper_2', 'worker_difficult', 'worker', 'worker_more_difficult', 'helper_1', 'abyss', 'one', 'two', 'lo', 'test'])

For each module, we provide the list of imports, function calls, and function definitions:

m_info["worker"].keys()

dict_keys(['import_list', 'call_list', 'func_defs'])

Then our extracted information looks something like this:

pprint(m_info["worker"])

{'call_list': [CallNode(module=[], name='print', call_lineno=5, called_by='do_stuff'),
               CallNode(module=[], name='greeting', call_lineno=5, called_by='do_stuff'),
               CallNode(module=[], name='name', call_lineno=5, called_by='do_stuff')],
 'func_defs': [FuncDefNode(name='do_stuff', module='worker', start_lineno=4, end_lineno=5, calls=[CallNode(module=[], name='print', call_lineno=5, called_by='do_stuff'), CallNode(module=[], name='greeting', call_lineno=5, called_by='do_stuff'), CallNode(module=[], name='name', call_lineno=5, called_by='do_stuff')])],
 'import_list': [ImportNode(module='helper_1', function_names=['greeting'], level=0, alias=''),
                 ImportNode(module='helper_2', function_names=['name'], level=0, alias='')]}

We have the function calls that are made, tracked by where they are called. We also have a log of the imported modules as well as the functions defined in a module.

Construct Code Graph

The module info dictionary m_info contains all the information we need to build up a code graph.

from code_graph import *

For each function we can use the call list to create from functions to their parent modules:

m_info["abyss"]["call_list"]

[CallNode(module=['np'], name='zeroes', call_lineno=3, called_by=None),
 CallNode(module=['np'], name='array', call_lineno=6, called_by='mul'),
 CallNode(module=['np'], name='array', call_lineno=8, called_by='mul'),
 CallNode(module=['np'], name='matmul', call_lineno=10, called_by='mul'),
 CallNode(module=['np'], name='array', call_lineno=13, called_by='eigs_of_product'),
 CallNode(module=['np'], name='array', call_lineno=15, called_by='eigs_of_product'),
 CallNode(module=['np'], name='matmul', call_lineno=17, called_by='eigs_of_product'),
 CallNode(module=['np', 'linalg'], name='eigs', call_lineno=18, called_by='eigs_of_product'),
 CallNode(module=['np', 'linalg', 'debug', 'depth'], name='error_print', call_lineno=19, called_by='eigs_of_product')]

create_function_call_edges_simple(m_info["abyss"])

[('mul', 'np.array'),
 ('mul', 'np.array'),
 ('mul', 'np.matmul'),
 ('eigs_of_product', 'np.array'),
 ('eigs_of_product', 'np.array'),
 ('eigs_of_product', 'np.matmul'),
 ('eigs_of_product', 'np.linalg.eigs'),
 ('eigs_of_product', 'np.linalg.debug.depth.error_print')]

We can see that some functions are called multiple times in the same definition, so we compress this information while keeping track of the number of calls:

create_function_call_edges(m_info["abyss"])

[('mul', 'np.array', 2),
 ('mul', 'np.matmul', 1),
 ('eigs_of_product', 'np.array', 2),
 ('eigs_of_product', 'np.matmul', 1),
 ('eigs_of_product', 'np.linalg.eigs', 1),
 ('eigs_of_product', 'np.linalg.debug.depth.error_print', 1)]

Visualize

Finally, we create mermaid graph descriptions of code.

from viz_code import generate_desc

Note, had to comment these out because my site generator became very confused about these markdown prints!

# print(generate_desc(create_function_call_edges_simple(m_info["worker"])))
# ```{mermaid}
#graph LR;
#   do_stuff[do_stuff] --> print[print];
#   do_stuff[do_stuff] --> greeting[greeting];
#   do_stuff[do_stuff] --> name[name];
# ```

graph LR;
    do_stuff[do_stuff] --> print[print];
    do_stuff[do_stuff] --> greeting[greeting];
    do_stuff[do_stuff] --> name[name];

# print(generate_desc(create_function_call_edges_simple(m_info["abyss"])))

graph LR;
    mul[mul] --> np.array[np.array];
    mul[mul] --> np.array[np.array];
    mul[mul] --> np.matmul[np.matmul];
    eigs_of_product[eigs_of_product] --> np.array[np.array];
    eigs_of_product[eigs_of_product] --> np.array[np.array];
    eigs_of_product[eigs_of_product] --> np.matmul[np.matmul];
    eigs_of_product[eigs_of_product] --> np.linalg.eigs[np.linalg.eigs];
    eigs_of_product[eigs_of_product] --> np.linalg.debug.dept[np.linalg.debug.depth.error_print];

In the next installment we’ll go into the details about how to actually create this parser with a final repo for creating a function dependency graph.

Addendum: More Complex Parsing

There are many complex cases that need to be handled to perform a thorough analysis. For instance, when the imports are not all tidily placed at the top and also:

importing multiple functions: from x import (a, b, c)
inline imports: sometimes you only want to perform an import when you are using a specific function
aliasing: import numpy as np means we have to track np in the code

work_w = walk_script("example/worker_more_difficult.py")

Here we have examples of all the above:

!cat example/worker_more_difficult.py

from helper_1 import greeting
from helper_3 import (a, b, c, d)
from collections import *

def do_stuff():
    import re
    import talk as tk
    print(f"{greeting()}, {name()}!")

from helper_2 import name, crab
from ..test import say_hello
import sub.one
import sub.subsub.two
from sub.one import cat
from sub.subsub.two import dog

we catch the imports outside of function definitions as before

module_node = work_w[0]
module_node.body

[<_ast.ImportFrom at 0x7ffb1452b670>,
 <_ast.ImportFrom at 0x7ffb1452bf40>,
 <_ast.ImportFrom at 0x7ffb1452ba00>,
 <_ast.FunctionDef at 0x7ffb144b5040>,
 <_ast.ImportFrom at 0x7ffb144b5520>,
 <_ast.ImportFrom at 0x7ffb144b55e0>,
 <_ast.Import at 0x7ffb144b5670>,
 <_ast.Import at 0x7ffb144b5730>,
 <_ast.ImportFrom at 0x7ffb144b57f0>,
 <_ast.ImportFrom at 0x7ffb144b5880>]

module_node.body[0].module

'helper_1'

module_node.body[0].names[0].name

'greeting'

When we have more than one function being imported:

module_node.body[1].module

'helper_3'

[n.name for n in module_node.body[1].names]

['a', 'b', 'c', 'd']

Importing everything from a module

module_node.body[2].module

'collections'

We can see this is going to be a problem for us to solve. We have no real way of logging the functions we’re bringing in without parsing the entire module:

module_node.body[2].names[0].name

'*'

We can use the .level attribute to determine that something was a relative import (from ..test import say_hello):

work_w[0].body[5].module

'test'

level is an integer holding the level of the relative import (0 means absolute import)

work_w[0].body[5].level

Subdirectory Imports and the `Import` Node

There is actually a distinct node type for module imports with the simpler import x syntax.

work_w[0].body[6]

<_ast.Import at 0x7ffb144b5670>

The unfortunate parts is that this does not have the level attribute, so we will have to manually parse the module names to determine when things were subdirectory imports:

work_w[0].body[6].names[0].name

'sub.one'

work_w[0].body[7].names[0].name

'sub.subsub.two'

Inline Imports

Our function definition include two inline import statments:

def do_stuff():
    import re
    import talk as tk
    print(f"{greeting()}, {name!")

cool_function = work_w[0].body[3]
cool_function

<_ast.FunctionDef at 0x7ffb144b5040>

cool_function.name

'do_stuff'

As with the module node, we can see these imports in the body of the function definition node:

cool_function.body

[<_ast.Import at 0x7ffb144b50d0>,
 <_ast.Import at 0x7ffb144b5160>,
 <_ast.Expr at 0x7ffb144b5280>]

cool_function.body[0].names[0].name

're'

cool_function.body[1].names[0].name

'talk'

cool_function.body[1].names[0].asname

'tk'

Calls

Next we need to better understand how to unpack the information in a Call node.

Trace Uses

!cat example/abyss.py

import numpy as np

z = np.zeroes(5)

def mul():
    a = np.array([[1, 0],
                  [0, 1]])
    b = np.array([[4, 1],
                  [2, 2]])
    return np.matmul(a, b)

def eigs_of_product():
    a = np.array([[1, 0],
                  [0, 1]])
    b = np.array([[4, 1],
                  [2, 2]])
    product = np.matmul(a, b)
    eigs = np.linalg.eigs(product)
    np.linalg.debug.depth.error_print(eigs)  # this is a fake call, comment out if you want to run this
    return eigs

work_w = walk_script("example/abyss.py")

from collections import Counter

We build a counter of node types:

type_counter = Counter()
for node in work_w:
    type_counter[type(node)] += 1

From this (comparing how many times we used np in the script to these counts), we can tell that the information about the numpy calls are found in nodes with type _ast.Call

type_counter

Counter({_ast.Module: 1,
         _ast.Import: 1,
         _ast.Assign: 7,
         _ast.FunctionDef: 2,
         _ast.alias: 1,
         _ast.Name: 23,
         _ast.Call: 9,
         _ast.arguments: 2,
         _ast.Return: 2,
         _ast.Expr: 1,
         _ast.Store: 7,
         _ast.Attribute: 13,
         _ast.Constant: 17,
         _ast.Load: 41,
         _ast.List: 12})

So, we grab just the calls like this:

calls = [n for n in work_w if isinstance(n, ast.Call)]

calls

[<_ast.Call at 0x7ffb144f1b20>,
 <_ast.Call at 0x7ffb144f1be0>,
 <_ast.Call at 0x7ffb144f1490>,
 <_ast.Call at 0x7ffb144f10a0>,
 <_ast.Call at 0x7ffb1459cfd0>,
 <_ast.Call at 0x7ffb1454fd30>,
 <_ast.Call at 0x7ffb1454f910>,
 <_ast.Call at 0x7ffb1458a760>,
 <_ast.Call at 0x7ffb1458a580>]

This is the basic structure of the call data:

calls[0].func.value.id

'np'

calls[0].func.attr

'zeroes'

calls[0].func.value.lineno

Nested calls (using a submodule, like when we use np.linalg.eigs) require different handling

call = calls[-2]

function = call.func

function

<_ast.Attribute at 0x7ffb1458a340>

function.value.attr

'linalg'

function.value.value.id

'np'

function.attr

'eigs'

As do deeply nested calls (np.linalg.debug.depth.error_print(eigs))

call = calls[-1]

function = call.func
function

<_ast.Attribute at 0x7ffb1458a670>

function.attr

'error_print'

function.value.attr

'depth'

you just keep grabbing .value until the type changes from ast.Attribute to ast.Name

function.value

<_ast.Attribute at 0x7ffb1458ad60>

function.value.value

<_ast.Attribute at 0x7ffb1458a700>

function.value.value.attr

'debug'

function.value.value.value

<_ast.Attribute at 0x7ffb1458a730>

function.value.value.value.attr

'linalg'

function.value.value.value.value

<_ast.Name at 0x7ffb1458aca0>

function.value.value.value.value.id

'np'

function.attr

'error_print'

In some of these examples we can begin to see an issue we’ll need to resolve later. Really, rather than use the walked list of nodes in the AST, we want to recursively walk from the top node and also use recursive calls to extract data from the nodes. More on that in the next article.