Intro
In the previous installment, we explored ast
and gave high-level details about how to use this to generate a graph description of a Python script. Here we present a full implementation of this process: pygraphcrawler.
Examples
Before we get into how it works, let’s see it in action. To parse code in a directory and create a Mermaid description:
from code_extraction import extract_code_information
from viz_code import create_graph_description
# m_info will house the code data
= extract_code_information(
m_info =["example", "example2"], # each directory will be searched
directories=["test.py"] # or specify particular files
other_python_filenames
)print(m_info.keys()) # to show us which modules we parsed
= "abyss" # select one of the module names
module_to_inspect # this is a markdown description of the select module
= create_graph_description(m_info[module_to_inspect]) mermaid_graph_desc
Multiple calls into submodules of import
import numpy as np
= np.zeroes(5)
z
def mul():
= np.array([[1, 0],
a 0, 1]])
[= np.array([[4, 1],
b 2, 2]])
[return np.matmul(a, b)
def eigs_of_product():
= np.array([[1, 0],
a 0, 1]])
[= np.array([[4, 1],
b 2, 2]])
[= np.matmul(a, b)
product = np.linalg.eigs(product)
eigs # this is a fake call for testing
np.linalg.debug.depth.error_print(eigs) return eigs
- external module calls are grouped
- repeated calls to a function are denoted with edge labels
Calls into different modules
import numpy as np
import re
= 10
pop_size = 0
low = 100
high
def random_draw():
return np.random.randint(low, high, pop_size)
def get_random_sample_mean():
= [random_draw() for _ in range(10)]
pop return np.mean(pop)
= r"\d\d\s\w*\s"
IMPORTANT_PATTERN def find(text):
return re.findall(IMPORTANT_PATTERN, text)
- at a glance we can see connections and distinct domains of defined functions
random_draw
is a helperfind
has nothing to do withget_random_sample_mean
Handling Classes
We have functions and classes that call other classes to use their functionality. We keep track of these dependencies.
import numpy as np
= []
things
def check():
0)
things.append(5)
np.zeros(
class Dog:
def __init__(self):
self.friend = Catdog()
self.shared_total = self.friend.run()
= Catdog()
test
test.run()
class Catdog:
def __init__(self, how):
self.how = how
def run(self):
= np.zeros(3)
test = test.sum()
total return total
= Catdog(3)
c = c.run()
t
def create_catdog_run():
= Catdog(3)
c = c.run()
t print(t)
Dog
is entirely depending onCatdog
. Good to know if you’re thinking about changingCatdog
.numpy
is an important dependency, everything is connected to it
Dogfooding
Now let’s apply the crawler to itself!
- you can nicely see the recursive nature of the process where
get_submodule_desc
is pointing to itself - the class subgraphs for our data containers like
ImportNode
andClassNode
are empty. This is because we’re using@dataclass
for simplicity. A potential improvement would be parsing them knowing that they have__init__
methods coming from the decorator - we can also see some dead code where
process_class_func_node
has nothing depending on it and we know we aren’t calling it
Collections
Finally, let’s look at one of the standard library modules. Given we use collections
when building this crawler, that seems like a fun choice.
On my machine (I’m using WSL 2), the library is stored at /usr/lib/python3.8/collections
. So, I extract the module by pointing to that directory, which has two files __init__
and abc
. The latter is pretty sparse, because it’s just performing imports.
If we were to get the entire graph for collections it would actually be pretty horrendous. This is because it has many classes in it and each class performs a ton of calls. To give you the power to select a subgraph of the code information there are options for create_graph_description
:
wanted_classes
- only display information about the classes whose names are in this listinclude_body_commands
- include the calls that are made in the body of the scriptinclude_function_defs
- include details about the module’s non-class function definitions
First, we extract the information to create the module info:
= "/usr/lib/python3.8/collections"
collections_location = extract_code_information(directories=[collections_location]) m_info
Then with [c.name for c in m_info["__init__"]["class_list"]]
we can see these are the classes in __init__.py
:
'_OrderedDictKeysView',
['_OrderedDictItemsView',
'_OrderedDictValuesView',
'_Link',
'OrderedDict',
'Counter',
'ChainMap',
'UserDict',
'UserList',
'UserString']
If we only want information about OrderedDict
then we use create_graph_description
with these arguments to mute body calls and other function definitions:
create_graph_description("__init__"],
m_info[=["OrderedDict"],
wanted_classes=False,
include_body_commands=False
include_function_defs )
- despite being a more complex example, the call depth is actually pretty shallow
- many operations are defined in terms of operations of simpler data structures, in this case operations on regular dictionaries
- we have some external module data here, from
warnings
,_sys
, and_collections_abc
. These aren’t used byOrderedDict
, but are handled differently, so you would have to do a little more work to mute them.
Let’s look at another class (Counter
) but show any non-class functions and work done in the script. We do this using the fact that the defaults are True
for showing that data:
"__init__"], wanted_classes=["Counter"]) create_graph_description(m_info[
We can see that there is a lot going on, but also a lot of common boiler plate used. When you are inspecting more complicated module likes this, you’ll want to use the options provided to break it into more readable chunks.
pygraphcrawler
- Code Graph Creation
The final repo is here. The main parts are
code_extraction.py
- entry point script to coordinate which files to parse and run the parserdep_parser.py
- generate module info by parsing a module usingast
code_graph.py
- convert module info into graph edge dataviz_code.py
- use edge data and module info to create Mermaid graph descriptions.
Crawl Code
We start by parsing the full script and look at the top-level module node. This lets us know the name of the script and gives us its immediate children in the ast
graph.
= get_top_level_node_from_filename(path)
module_node
= parse_module_node(
import_list, call_list, func_defs, class_list =verbose
module_node, current_module_name, verbose )
This work is done in parse_module_node
, where we go through each node in the body of the module_node
and depending on the node type, process it and its child ast
nodes. This allows us to distinguish calls to functions that are made inside function definitions from calls that are made in the general script, as well as annotate those dependencies. By walking these parsed elements, we can extract the data stored in an ast
node and convert it to more convenient classes. For instance rather than working with ast.Call
objects we use a custom CallNode
object:
@dataclass
class CallNode:
str # what module does the called function belong to
module: str # function name
name: int # where was the call
call_lineno: str = None # what was the caller called_by:
Then, provided an ast.Call
node, we parse it for that data above like this:
= node.func # func is an attribute of an ast.Call object
func_data
# func_data can either be an Attribute or a Name, which require
# different ways to access their properties
if isinstance(func_data, ast.Attribute):
= func_data.attr
function_name = func_data.value
value # handle submodule calls like np.linalg.norm
= get_submodule_desc(value)
submodule_desc
submodule_desc.reverse()
= CallNode(
call_node =submodule_desc, name=function_name, call_lineno=node.lineno
module
)elif isinstance(func_data, ast.Name):
= CallNode(
call_node =func_data.id,
name=[], # the call node did not have this detail, requiring separate handling
module=node.lineno,
call_lineno
)else:
print("error")
print(node)
return None
By parsing it, we have an easier way to reference the call and its properties. We have similar classes and parsers for the other node types.
Once we have the module data we collect it into a dictionary specifying:
import_list
- which modules were imported and where did we call themcall_list
- all function callsfunc_defs
- function definitionsclass_list
- class data including methods
From here, we can take this parsed module data and convert it to edges. This is where we can prune some of the lower level function calls (like print()
) that we are not interested in visualizing. We also can compress multiple calls from one function to another into a single edge with a weight tracking the call count (as in the first example).
Finally, we take the edges and module data and create the Mermaid graph description:
- edge data allows us to specify the graph edges and carefully check naming conventions
- module data lets us create subgraphs to group calls into the same external module. Grouping calls in this way makes dependencies a lot easier to understand.
- module data also enables grouping class methods
Uses
Code Analysis
As we saw in the examples, the code graph can tell us some basic things at a glance:
- what are the top level functions - look at the nodes farthest to the left
- what external libraries do we have the most dependencies on - look at call counts into the module subgraphs and terminal nodes with many ancestors
- do we have dead code - is there a component of the graph that contains a top level function we no longer use?
Documentation
For smaller modules, this is also a form of documentation. You can get a pretty good idea of what a function is doing by looking at its name and what it calls. For larger modules, this still applies but something with many definitions can become cluttered and you may want to break it into pieces.
Extensions
Code Similarity
Given two code graphs, how can we measure similarity?
- use basic graph similarity measures from a graph object constructed from the edge data
- use the calls into external modules and see how much they overlap
- use the markdown graph as a compressed version of two modules and run text similarity methods, which can be difficult to run on the non-compressed original source
By identifying when we have similar code structures, we can begin to develop better abstractions of code (as needed). Right now, this only generates the graph data, but this would be a possible extension of that data.
Code Generation
Using the basic building blocks of an answer to a small problem, can we randomly recreate it?
This is more of a fringe use case, but you could use the inter-function dependency data to shrink the space of grammatical code expressions in order to randomly generate code. Not quite a copilot by any stretch, but could generate some interesting results.
Lessons Learned
When to Walk, When to Run
Initially, I was parsing code using the full list of nodes generated by ast.walk()
. Unfortunately, this makes it difficult to connect calls to the function definitions they may reside in. In the walked list, you can get a function definition node followed by call nodes in the list, but it then requires more work to determine if those calls were in the function definition or outside of it just later in the walked tree. You could get around this with line number comparisons (I leave some deprecated functions in the repo for doing this), but I found it simpler to recursively use the node structure generated by ast
.
On the other hand, you sometimes do want to sweep through all the child nodes. When you manually walk the tree, you can end up with heavily nested structure to call. For instance, f() + (2 * (g() + h()))
involves going through the expression and parsing binary operations and nested expressions just to get the data that you care about (the three functions that were called).
When you crawl the node structure, selectively using ast.walk
on some of the child nodes of the top level module node can avoid this. Get the highest level nodes – like classes and function definitions – and then allow ast
to quickly walk all the children of those nodes so you can get all the data you care about and also tie it to those higher level nodes.
Keywords and Meta-programming
In my graph description generator, I’m careful to rename words like “map” and “find” when they are used outside of node labels. This is because these appear to be keywords of the process of generating the graph image from the markdown, so including them as regular names of nodes will cause the preview to error out on VSCode. There was also a special case for the word “end” appearing as part of a node name because this conflicts with using “end” to delineate subgraphs, but the fix is to capitalize part of “end”. This is why it is important to have input sanitation performed before compiling the final product.
This is a general concern to keep in mind, when you’re creating descriptions of code you want to be careful about escaping problematic terms so that whatever system you are using does not confuse a command with a description. I found similar issues when printing out the Mermaid graphs in a notebook, sometimes causing the entire notebook to hit some invalid states. Hard to issue good guidance other than “be on the lookout.”
Low-level Calls
In common expressions like some_list.append(item)
you are performing a function call. While it may be useful to see this in the final graph, I found that it cluttered the visualization. To get around this, I kept such things in the parsed data, but use explicit lists of common function names to exclude when creating the graph markdown. Along with some rules for how to handle parsing a call’s module, this removed a lot of noise.
This is an opportunity for improvement though, because I’m using simple rules to determine when to skip over calls. Ideally using something to determine the type of object a function is attached to (are we running an operation on a dictionary or a list) would improve this.
Classes
Related to difficulty in detecting type information, when you have class definitions involved things begin to get tricky. In fact, I thought I was done with the project before I saw how large of a gap effectively processing classes was.
Getting the function definition information out of class methods is straightforward, because things are nicely grouped in an ast
node for the class. Less obvious is how to determine when you are making a call to an instantiated object of a class.
My solution was a hack: when you have a call, check if the name overlaps with any of the names of classes in the module. If so, it’s likely creating an object, so jump up the walked syntax tree to the assignment and keep track of the name used for the assignment target (the x
in x = 1
). Then in that scope you know any calls involving that object name are calls to the class methods and you can track it appropriately.
To do this “the right way” you would have to almost run your own version of the interpreter, as mentioned in the low-level calls lesson. This would definitely be an interesting problem to work on, but I’m satisfied omitting the ability to parse more elaborate Python code in this visualizer.
Have Fun
Feel free to experiment with the code and let me know if you find other use cases or if you’d like to contribute.