graph LR;
mul[mul] -->|2| np.array[np.array];
mul[mul] -->|2| np.array[np.array];
mul[mul] --> np.matmul[np.matmul];
eigs_of_product[eigs_of_product] -->|2| np.array[np.array];
eigs_of_product[eigs_of_product] -->|2| np.array[np.array];
eigs_of_product[eigs_of_product] --> np.matmul[np.matmul];
eigs_of_product[eigs_of_product] --> np.linalg.eigs[np.linalg.eigs];
eigs_of_product[eigs_of_product] --> np.linalg.debug.dept[np.linalg.debug.depth.error_print];
main[main] --> np.zeroes[np.zeroes];
subgraph np
np.linalg.debug.dept
np.array
np.linalg.eigs
np.matmul
np.zeroes
end
Intro
In the previous installment, we explored ast and gave high-level details about how to use this to generate a graph description of a Python script. Here we present a full implementation of this process: pygraphcrawler.
Examples
Before we get into how it works, let’s see it in action. To parse code in a directory and create a Mermaid description:
from code_extraction import extract_code_information
from viz_code import create_graph_description
# m_info will house the code data
m_info = extract_code_information(
directories=["example", "example2"], # each directory will be searched
other_python_filenames=["test.py"] # or specify particular files
)
print(m_info.keys()) # to show us which modules we parsed
module_to_inspect = "abyss" # select one of the module names
# this is a markdown description of the select module
mermaid_graph_desc = create_graph_description(m_info[module_to_inspect])Multiple calls into submodules of import
import numpy as np
z = np.zeroes(5)
def mul():
a = np.array([[1, 0],
[0, 1]])
b = np.array([[4, 1],
[2, 2]])
return np.matmul(a, b)
def eigs_of_product():
a = np.array([[1, 0],
[0, 1]])
b = np.array([[4, 1],
[2, 2]])
product = np.matmul(a, b)
eigs = np.linalg.eigs(product)
np.linalg.debug.depth.error_print(eigs) # this is a fake call for testing
return eigs- external module calls are grouped
- repeated calls to a function are denoted with edge labels
Calls into different modules
import numpy as np
import re
pop_size = 10
low = 0
high = 100
def random_draw():
return np.random.randint(low, high, pop_size)
def get_random_sample_mean():
pop = [random_draw() for _ in range(10)]
return np.mean(pop)
IMPORTANT_PATTERN = r"\d\d\s\w*\s"
def find(text):
return re.findall(IMPORTANT_PATTERN, text)graph LR;
random_draw[random_draw] --> np.random.randint[np.random.randint];
get_random_sample_me[get_random_sample_mean] --> random_draw[random_draw];
get_random_sample_me[get_random_sample_mean] --> np.mean[np.mean];
py.find[find] --> re.findall[re.findall];
subgraph np
np.mean
np.random.randint
end
subgraph re
re.findall
end
- at a glance we can see connections and distinct domains of defined functions
random_drawis a helperfindhas nothing to do withget_random_sample_mean
Handling Classes
We have functions and classes that call other classes to use their functionality. We keep track of these dependencies.
import numpy as np
things = []
def check():
things.append(0)
np.zeros(5)
class Dog:
def __init__(self):
self.friend = Catdog()
self.shared_total = self.friend.run()
test = Catdog()
test.run()
class Catdog:
def __init__(self, how):
self.how = how
def run(self):
test = np.zeros(3)
total = test.sum()
return total
c = Catdog(3)
t = c.run()
def create_catdog_run():
c = Catdog(3)
t = c.run()
print(t)graph LR;
check[check] --> np.zeros[np.zeros];
create_catdog_run[create_catdog_run] --> Catdog.__init__[Catdog.__init__];
create_catdog_run[create_catdog_run] --> Catdog.run[Catdog.run];
Dog.__init__[Dog.__init__] -->|2| Catdog.__init__[Catdog.__init__];
Dog.__init__[Dog.__init__] -->|2| Catdog.run[Catdog.run];
Dog.__init__[Dog.__init__] -->|2| Catdog.__init__[Catdog.__init__];
Dog.__init__[Dog.__init__] -->|2| Catdog.run[Catdog.run];
Catdog.run[Catdog.run] --> np.zeros[np.zeros];
main[main] --> Catdog.__init__[Catdog.__init__];
main[main] --> Catdog.run[Catdog.run];
subgraph Dog
Dog.__init__
end
subgraph Catdog
Catdog.__init__
Catdog.run
end
subgraph np
np.zeros
end
Dogis entirely depending onCatdog. Good to know if you’re thinking about changingCatdog.numpyis an important dependency, everything is connected to it
Dogfooding
Now let’s apply the crawler to itself!
graph LR;
walk_script[walk_script] --> ast.parse[ast.parse];
walk_script[walk_script] --> worker_script.read[worker_script.read];
walk_script[walk_script] --> ast.walk[ast.walk];
get_submodule_desc[get_submodule_desc] --> get_submodule_desc[get_submodule_desc];
process_from_import_[process_from_import_node] --> ImportNode.__init__[ImportNode.__init__];
process_import_node[process_import_node] --> ImportNode.__init__[ImportNode.__init__];
process_call_node[process_call_node] --> print_call_node_info[print_call_node_info];
process_call_node[process_call_node] --> get_submodule_desc[get_submodule_desc];
process_call_node[process_call_node] --> CallNode.__init__[CallNode.__init__];
process_call_node[process_call_node] --> CallNode.__init__[CallNode.__init__];
process_func_def_nod[process_func_def_node] --> FuncDefNode.__init__[FuncDefNode.__init__];
process_class_node[process_class_node] --> ClassNode.__init__[ClassNode.__init__];
process_class_func_n[process_class_func_node] --> FuncDefNode.__init__[FuncDefNode.__init__];
add_import[add_import] --> process_import_node[process_import_node];
add_import[add_import] --> process_from_import_[process_from_import_node];
add_call_or_import[add_call_or_import] --> process_import_node[process_import_node];
add_call_or_import[add_call_or_import] --> process_from_import_[process_from_import_node];
add_call_or_import[add_call_or_import] --> process_call_node[process_call_node];
walk_node_children[walk_node_children] --> ast.walk[ast.walk];
process_class_functi[process_class_function_def] --> process_func_def_nod[process_func_def_node];
process_class_functi[process_class_function_def] --> process_func_def_chi[process_func_def_children];
process_class_method[process_class_methods] --> process_class_functi[process_class_function_def];
get_instantiated_obj[get_instantiated_object_name] --> sibling_list.index[sibling_list.index];
update_call_data_for[update_call_data_for_object_info] --> get_instantiated_obj[get_instantiated_object_name];
process_func_def_chi[process_func_def_children] --> walk_node_children[walk_node_children];
process_func_def_chi[process_func_def_children] --> process_func_def_nod[process_func_def_node];
process_func_def_chi[process_func_def_children] --> add_import[add_import];
process_func_def_chi[process_func_def_children] --> process_call_node[process_call_node];
process_func_def_chi[process_func_def_children] --> update_call_data_for[update_call_data_for_object_info];
process_script_work[process_script_work] --> walk_node_children[walk_node_children];
process_script_work[process_script_work] --> process_call_node[process_call_node];
process_script_work[process_script_work] --> update_call_data_for[update_call_data_for_object_info];
parse_module_node[parse_module_node] --> process_class_method[process_class_methods];
parse_module_node[parse_module_node] --> process_class_node[process_class_node];
parse_module_node[parse_module_node] --> process_func_def_nod[process_func_def_node];
parse_module_node[parse_module_node] --> process_func_def_chi[process_func_def_children];
parse_module_node[parse_module_node] --> add_import[add_import];
parse_module_node[parse_module_node] --> process_script_work[process_script_work];
manage_module_import[manage_module_imports] --> collections.defaultd[collections.defaultdict];
manage_module_import[manage_module_imports] --> ImportNode.__init__[ImportNode.__init__];
get_walked_scripted_[get_walked_scripted_from_filename] --> pathlib.Path[pathlib.Path];
get_walked_scripted_[get_walked_scripted_from_filename] --> walk_script[walk_script];
get_top_level_node_f[get_top_level_node_from_filename] --> ast.parse[ast.parse];
get_top_level_node_f[get_top_level_node_from_filename] --> script_contents.read[script_contents.read];
extract_node_structu[extract_node_structure_from_script] --> pathlib.Path[pathlib.Path];
extract_node_structu[extract_node_structure_from_script] --> get_top_level_node_f[get_top_level_node_from_filename];
extract_node_structu[extract_node_structure_from_script] --> parse_module_node[parse_module_node];
extract_node_structu[extract_node_structure_from_script] --> manage_module_import[manage_module_imports];
subgraph ImportNode
end
subgraph CallNode
end
subgraph FuncDefNode
end
subgraph ClassNode
end
subgraph pathlib
pathlib.Path
end
subgraph collections
collections.defaultd
end
subgraph ast
ast.parse
ast.walk
end
- you can nicely see the recursive nature of the process where
get_submodule_descis pointing to itself - the class subgraphs for our data containers like
ImportNodeandClassNodeare empty. This is because we’re using@dataclassfor simplicity. A potential improvement would be parsing them knowing that they have__init__methods coming from the decorator - we can also see some dead code where
process_class_func_nodehas nothing depending on it and we know we aren’t calling it
Collections
Finally, let’s look at one of the standard library modules. Given we use collections when building this crawler, that seems like a fun choice.
On my machine (I’m using WSL 2), the library is stored at /usr/lib/python3.8/collections. So, I extract the module by pointing to that directory, which has two files __init__ and abc. The latter is pretty sparse, because it’s just performing imports.
If we were to get the entire graph for collections it would actually be pretty horrendous. This is because it has many classes in it and each class performs a ton of calls. To give you the power to select a subgraph of the code information there are options for create_graph_description:
wanted_classes- only display information about the classes whose names are in this listinclude_body_commands- include the calls that are made in the body of the scriptinclude_function_defs- include details about the module’s non-class function definitions
First, we extract the information to create the module info:
collections_location = "/usr/lib/python3.8/collections"
m_info = extract_code_information(directories=[collections_location])Then with [c.name for c in m_info["__init__"]["class_list"]] we can see these are the classes in __init__.py:
['_OrderedDictKeysView',
'_OrderedDictItemsView',
'_OrderedDictValuesView',
'_Link',
'OrderedDict',
'Counter',
'ChainMap',
'UserDict',
'UserList',
'UserString']If we only want information about OrderedDict then we use create_graph_description with these arguments to mute body calls and other function definitions:
create_graph_description(
m_info["__init__"],
wanted_classes=["OrderedDict"],
include_body_commands=False,
include_function_defs=False
)graph LR;
OrderedDict.__init__[OrderedDict.__init__] --> _Link.__init__[_Link.__init__];
OrderedDict.__init__[OrderedDict.__init__] --> _proxy[_proxy];
OrderedDict.__init__[OrderedDict.__init__] --> self.__update[self.__update];
OrderedDict.__setite[OrderedDict.__setitem__] --> Link[Link];
OrderedDict.__setite[OrderedDict.__setitem__] --> proxy[proxy];
OrderedDict.__setite[OrderedDict.__setitem__] --> dict_setitem[dict_setitem];
OrderedDict.__delite[OrderedDict.__delitem__] --> dict_delitem[dict_delitem];
OrderedDict.__delite[OrderedDict.__delitem__] --> self.__map.pop[self.__map.pop];
OrderedDict.clear[OrderedDict.clear] --> self.__map.clear[self.__map.clear];
OrderedDict.clear[OrderedDict.clear] --> dict.clear[dict.clear];
OrderedDict.popitem[OrderedDict.popitem] --> dict.pop[dict.pop];
OrderedDict.__sizeof[OrderedDict.__sizeof__] --> sizeof[sizeof];
OrderedDict.__sizeof[OrderedDict.__sizeof__] --> sizeof[sizeof];
OrderedDict.__sizeof[OrderedDict.__sizeof__] --> sizeof[sizeof];
OrderedDict.__sizeof[OrderedDict.__sizeof__] --> sizeof[sizeof];
OrderedDict.keys[OrderedDict.keys] --> _OrderedDictKeysView[_OrderedDictKeysView.__init__];
OrderedDict.items[OrderedDict.items] --> _OrderedDictItemsVie[_OrderedDictItemsView.__init__];
OrderedDict.values[OrderedDict.values] --> _OrderedDictValuesVi[_OrderedDictValuesView.__init__];
OrderedDict.__reduce[OrderedDict.__reduce__] --> copy[copy];
OrderedDict.__reduce[OrderedDict.__reduce__] --> vars[vars];
OrderedDict.__reduce[OrderedDict.__reduce__] --> vars[vars];
OrderedDict.__reduce[OrderedDict.__reduce__] --> OrderedDict.__init__[OrderedDict.__init__];
OrderedDict.__reduce[OrderedDict.__reduce__] --> inst_dict.pop[inst_dict.pop];
OrderedDict.copy[OrderedDict.copy] --> self.__class__[self.__class__];
OrderedDict.__eq__[OrderedDict.__eq__] --> dict.__eq__[dict.__eq__];
OrderedDict.__eq__[OrderedDict.__eq__] --> dict.__eq__[dict.__eq__];
subgraph OrderedDict
OrderedDict.__init__
OrderedDict.__setite
OrderedDict.__delite
OrderedDict.__iter__
OrderedDict.__revers
OrderedDict.clear
OrderedDict.popitem
OrderedDict.move_to_
OrderedDict.__sizeof
OrderedDict.keys
OrderedDict.items
OrderedDict.values
OrderedDict.pop
OrderedDict.setdefau
OrderedDict.__repr__
OrderedDict.__reduce
OrderedDict.copy
OrderedDict.fromkeys
OrderedDict.__eq__
end
subgraph _collections_abc
_collections_abc.Mut
end
subgraph _sys
_sys._getframe
_sys.intern
end
subgraph warnings
warnings.warn
end
- despite being a more complex example, the call depth is actually pretty shallow
- many operations are defined in terms of operations of simpler data structures, in this case operations on regular dictionaries
- we have some external module data here, from
warnings,_sys, and_collections_abc. These aren’t used byOrderedDict, but are handled differently, so you would have to do a little more work to mute them.
Let’s look at another class (Counter) but show any non-class functions and work done in the script. We do this using the fact that the defaults are True for showing that data:
create_graph_description(m_info["__init__"], wanted_classes=["Counter"])graph LR;
__getattr__[__getattr__] --> getattr[getattr];
__getattr__[__getattr__] --> warnings.warn[warnings.warn];
__getattr__[__getattr__] --> globals[globals];
__getattr__[__getattr__] --> AttributeError[AttributeError];
namedtuple[namedtuple] --> field_names.replace[field_names.replace];
namedtuple[namedtuple] --> _sys.intern[_sys.intern];
namedtuple[namedtuple] --> seen.add[seen.add];
namedtuple[namedtuple] --> _iskeyword[_iskeyword];
namedtuple[namedtuple] --> name.startswith[name.startswith];
namedtuple[namedtuple] --> name.isidentifier[name.isidentifier];
namedtuple[namedtuple] --> _iskeyword[_iskeyword];
namedtuple[namedtuple] --> name.isidentifier[name.isidentifier];
namedtuple[namedtuple] --> seen.add[seen.add];
namedtuple[namedtuple] --> name.startswith[name.startswith];
namedtuple[namedtuple] --> exec[exec];
namedtuple[namedtuple] --> tuple_new[tuple_new];
namedtuple[namedtuple] --> _len[_len];
namedtuple[namedtuple] --> self._make[self._make];
namedtuple[namedtuple] --> _map[_map];
namedtuple[namedtuple] --> _dict[_dict];
namedtuple[namedtuple] --> _zip[_zip];
namedtuple[namedtuple] --> _tuple[_tuple];
namedtuple[namedtuple] --> _sys.intern[_sys.intern];
namedtuple[namedtuple] --> _tuplegetter[_tuplegetter];
namedtuple[namedtuple] --> f_globals.get[f_globals.get];
namedtuple[namedtuple] --> _sys._getframe[_sys._getframe];
_count_elements[_count_elements] --> mapping_get[mapping_get];
Counter.__init__[Counter.__init__] --> __init__[__init__];
Counter.__init__[Counter.__init__] --> super[super];
Counter.__init__[Counter.__init__] --> self.update[self.update];
Counter.most_common[Counter.most_common] --> sorted[sorted];
Counter.most_common[Counter.most_common] --> _itemgetter[_itemgetter];
Counter.most_common[Counter.most_common] --> _heapq.nlargest[_heapq.nlargest];
Counter.most_common[Counter.most_common] --> _itemgetter[_itemgetter];
Counter.elements[Counter.elements] --> _chain.from_iterable[_chain.from_iterable];
Counter.elements[Counter.elements] --> _starmap[_starmap];
Counter.fromkeys[Counter.fromkeys] --> NotImplementedError[NotImplementedError];
Counter.update[Counter.update] --> _count_elements[_count_elements];
Counter.update[Counter.update] --> update[update];
Counter.update[Counter.update] --> self_get[self_get];
Counter.update[Counter.update] --> super[super];
Counter.update[Counter.update] --> self.update[self.update];
Counter.subtract[Counter.subtract] --> self_get[self_get];
Counter.subtract[Counter.subtract] --> self_get[self_get];
Counter.subtract[Counter.subtract] --> self.subtract[self.subtract];
Counter.copy[Counter.copy] --> self.__class__[self.__class__];
Counter.__delitem__[Counter.__delitem__] --> __delitem__[__delitem__];
Counter.__delitem__[Counter.__delitem__] --> super[super];
Counter.__repr__[Counter.__repr__] --> format[format];
Counter.__repr__[Counter.__repr__] --> self.most_common[self.most_common];
Counter.__add__[Counter.__add__] --> Counter.__init__[Counter.__init__];
Counter.__sub__[Counter.__sub__] --> Counter.__init__[Counter.__init__];
Counter.__or__[Counter.__or__] --> Counter.__init__[Counter.__init__];
Counter.__and__[Counter.__and__] --> Counter.__init__[Counter.__init__];
Counter.__pos__[Counter.__pos__] --> Counter.__init__[Counter.__init__];
Counter.__neg__[Counter.__neg__] --> Counter.__init__[Counter.__init__];
Counter.__iadd__[Counter.__iadd__] --> self._keep_positive[self._keep_positive];
Counter.__isub__[Counter.__isub__] --> self._keep_positive[self._keep_positive];
Counter.__ior__[Counter.__ior__] --> self._keep_positive[self._keep_positive];
Counter.__iand__[Counter.__iand__] --> self._keep_positive[self._keep_positive];
main[main] --> _collections_abc.Mut[_collections_abc.MutableSequence.register];
main[main] --> property[property];
main[main] --> _itemgetter[_itemgetter];
main[main] --> getattr[getattr];
main[main] --> warnings.warn[warnings.warn];
main[main] --> globals[globals];
main[main] --> AttributeError[AttributeError];
main[main] --> field_names.replace[field_names.replace];
main[main] --> _sys.intern[_sys.intern];
main[main] --> seen.add[seen.add];
main[main] --> _iskeyword[_iskeyword];
main[main] --> name.startswith[name.startswith];
main[main] --> name.isidentifier[name.isidentifier];
main[main] --> _iskeyword[_iskeyword];
main[main] --> name.isidentifier[name.isidentifier];
main[main] --> seen.add[seen.add];
main[main] --> name.startswith[name.startswith];
main[main] --> exec[exec];
main[main] --> tuple_new[tuple_new];
main[main] --> _len[_len];
main[main] --> self._make[self._make];
main[main] --> _map[_map];
main[main] --> _dict[_dict];
main[main] --> _zip[_zip];
main[main] --> _tuple[_tuple];
main[main] --> _sys.intern[_sys.intern];
main[main] --> _tuplegetter[_tuplegetter];
main[main] --> f_globals.get[f_globals.get];
main[main] --> _sys._getframe[_sys._getframe];
main[main] --> mapping_get[mapping_get];
main[main] --> getattr[getattr];
main[main] --> warnings.warn[warnings.warn];
main[main] --> globals[globals];
main[main] --> AttributeError[AttributeError];
main[main] --> field_names.replace[field_names.replace];
main[main] --> _sys.intern[_sys.intern];
main[main] --> seen.add[seen.add];
main[main] --> _iskeyword[_iskeyword];
main[main] --> name.startswith[name.startswith];
main[main] --> name.isidentifier[name.isidentifier];
main[main] --> _iskeyword[_iskeyword];
main[main] --> name.isidentifier[name.isidentifier];
main[main] --> seen.add[seen.add];
main[main] --> name.startswith[name.startswith];
main[main] --> exec[exec];
main[main] --> tuple_new[tuple_new];
main[main] --> _len[_len];
main[main] --> self._make[self._make];
main[main] --> _map[_map];
main[main] --> _dict[_dict];
main[main] --> _zip[_zip];
main[main] --> _tuple[_tuple];
main[main] --> _sys.intern[_sys.intern];
main[main] --> _tuplegetter[_tuplegetter];
main[main] --> f_globals.get[f_globals.get];
main[main] --> _sys._getframe[_sys._getframe];
main[main] --> mapping_get[mapping_get];
subgraph Counter
Counter.__init__
Counter.__missing__
Counter.most_common
Counter.elements
Counter.fromkeys
Counter.update
Counter.subtract
Counter.copy
Counter.__reduce__
Counter.__delitem__
Counter.__repr__
Counter.__add__
Counter.__sub__
Counter.__or__
Counter.__and__
Counter.__pos__
Counter.__neg__
Counter._keep_positi
Counter.__iadd__
Counter.__isub__
Counter.__ior__
Counter.__iand__
end
subgraph _collections_abc
_collections_abc.Mut
end
subgraph _sys
_sys._getframe
_sys.intern
end
subgraph warnings
warnings.warn
end
We can see that there is a lot going on, but also a lot of common boiler plate used. When you are inspecting more complicated module likes this, you’ll want to use the options provided to break it into more readable chunks.
pygraphcrawler - Code Graph Creation
The final repo is here. The main parts are
code_extraction.py- entry point script to coordinate which files to parse and run the parserdep_parser.py- generate module info by parsing a module usingastcode_graph.py- convert module info into graph edge dataviz_code.py- use edge data and module info to create Mermaid graph descriptions.
Crawl Code
We start by parsing the full script and look at the top-level module node. This lets us know the name of the script and gives us its immediate children in the ast graph.
module_node = get_top_level_node_from_filename(path)
import_list, call_list, func_defs, class_list = parse_module_node(
module_node, current_module_name, verbose=verbose
)This work is done in parse_module_node, where we go through each node in the body of the module_node and depending on the node type, process it and its child ast nodes. This allows us to distinguish calls to functions that are made inside function definitions from calls that are made in the general script, as well as annotate those dependencies. By walking these parsed elements, we can extract the data stored in an ast node and convert it to more convenient classes. For instance rather than working with ast.Call objects we use a custom CallNode object:
@dataclass
class CallNode:
module: str # what module does the called function belong to
name: str # function name
call_lineno: int # where was the call
called_by: str = None # what was the callerThen, provided an ast.Call node, we parse it for that data above like this:
func_data = node.func # func is an attribute of an ast.Call object
# func_data can either be an Attribute or a Name, which require
# different ways to access their properties
if isinstance(func_data, ast.Attribute):
function_name = func_data.attr
value = func_data.value
# handle submodule calls like np.linalg.norm
submodule_desc = get_submodule_desc(value)
submodule_desc.reverse()
call_node = CallNode(
module=submodule_desc, name=function_name, call_lineno=node.lineno
)
elif isinstance(func_data, ast.Name):
call_node = CallNode(
name=func_data.id,
module=[], # the call node did not have this detail, requiring separate handling
call_lineno=node.lineno,
)
else:
print("error")
print(node)
return NoneBy parsing it, we have an easier way to reference the call and its properties. We have similar classes and parsers for the other node types.
Once we have the module data we collect it into a dictionary specifying:
import_list- which modules were imported and where did we call themcall_list- all function callsfunc_defs- function definitionsclass_list- class data including methods
From here, we can take this parsed module data and convert it to edges. This is where we can prune some of the lower level function calls (like print()) that we are not interested in visualizing. We also can compress multiple calls from one function to another into a single edge with a weight tracking the call count (as in the first example).
Finally, we take the edges and module data and create the Mermaid graph description:
- edge data allows us to specify the graph edges and carefully check naming conventions
- module data lets us create subgraphs to group calls into the same external module. Grouping calls in this way makes dependencies a lot easier to understand.
- module data also enables grouping class methods
Uses
Code Analysis
As we saw in the examples, the code graph can tell us some basic things at a glance:
- what are the top level functions - look at the nodes farthest to the left
- what external libraries do we have the most dependencies on - look at call counts into the module subgraphs and terminal nodes with many ancestors
- do we have dead code - is there a component of the graph that contains a top level function we no longer use?
Documentation
For smaller modules, this is also a form of documentation. You can get a pretty good idea of what a function is doing by looking at its name and what it calls. For larger modules, this still applies but something with many definitions can become cluttered and you may want to break it into pieces.
Extensions
Code Similarity
Given two code graphs, how can we measure similarity?
- use basic graph similarity measures from a graph object constructed from the edge data
- use the calls into external modules and see how much they overlap
- use the markdown graph as a compressed version of two modules and run text similarity methods, which can be difficult to run on the non-compressed original source
By identifying when we have similar code structures, we can begin to develop better abstractions of code (as needed). Right now, this only generates the graph data, but this would be a possible extension of that data.
Code Generation
Using the basic building blocks of an answer to a small problem, can we randomly recreate it?
This is more of a fringe use case, but you could use the inter-function dependency data to shrink the space of grammatical code expressions in order to randomly generate code. Not quite a copilot by any stretch, but could generate some interesting results.
Lessons Learned
When to Walk, When to Run
Initially, I was parsing code using the full list of nodes generated by ast.walk(). Unfortunately, this makes it difficult to connect calls to the function definitions they may reside in. In the walked list, you can get a function definition node followed by call nodes in the list, but it then requires more work to determine if those calls were in the function definition or outside of it just later in the walked tree. You could get around this with line number comparisons (I leave some deprecated functions in the repo for doing this), but I found it simpler to recursively use the node structure generated by ast.
On the other hand, you sometimes do want to sweep through all the child nodes. When you manually walk the tree, you can end up with heavily nested structure to call. For instance, f() + (2 * (g() + h())) involves going through the expression and parsing binary operations and nested expressions just to get the data that you care about (the three functions that were called).
When you crawl the node structure, selectively using ast.walk on some of the child nodes of the top level module node can avoid this. Get the highest level nodes – like classes and function definitions – and then allow ast to quickly walk all the children of those nodes so you can get all the data you care about and also tie it to those higher level nodes.
Keywords and Meta-programming
In my graph description generator, I’m careful to rename words like “map” and “find” when they are used outside of node labels. This is because these appear to be keywords of the process of generating the graph image from the markdown, so including them as regular names of nodes will cause the preview to error out on VSCode. There was also a special case for the word “end” appearing as part of a node name because this conflicts with using “end” to delineate subgraphs, but the fix is to capitalize part of “end”. This is why it is important to have input sanitation performed before compiling the final product.
This is a general concern to keep in mind, when you’re creating descriptions of code you want to be careful about escaping problematic terms so that whatever system you are using does not confuse a command with a description. I found similar issues when printing out the Mermaid graphs in a notebook, sometimes causing the entire notebook to hit some invalid states. Hard to issue good guidance other than “be on the lookout.”
Low-level Calls
In common expressions like some_list.append(item) you are performing a function call. While it may be useful to see this in the final graph, I found that it cluttered the visualization. To get around this, I kept such things in the parsed data, but use explicit lists of common function names to exclude when creating the graph markdown. Along with some rules for how to handle parsing a call’s module, this removed a lot of noise.
This is an opportunity for improvement though, because I’m using simple rules to determine when to skip over calls. Ideally using something to determine the type of object a function is attached to (are we running an operation on a dictionary or a list) would improve this.
Classes
Related to difficulty in detecting type information, when you have class definitions involved things begin to get tricky. In fact, I thought I was done with the project before I saw how large of a gap effectively processing classes was.
Getting the function definition information out of class methods is straightforward, because things are nicely grouped in an ast node for the class. Less obvious is how to determine when you are making a call to an instantiated object of a class.
My solution was a hack: when you have a call, check if the name overlaps with any of the names of classes in the module. If so, it’s likely creating an object, so jump up the walked syntax tree to the assignment and keep track of the name used for the assignment target (the x in x = 1). Then in that scope you know any calls involving that object name are calls to the class methods and you can track it appropriately.
To do this “the right way” you would have to almost run your own version of the interpreter, as mentioned in the low-level calls lesson. This would definitely be an interesting problem to work on, but I’m satisfied omitting the ability to parse more elaborate Python code in this visualizer.
Have Fun
Feel free to experiment with the code and let me know if you find other use cases or if you’d like to contribute.