Listing all files in a Git repository with pygit2

One of the more interesting programming tasks I've had recently is trying to list all the files in a Git repository programmatically. One method to do this would be to just checkout a repository and then walk the filesystem. But I'd much rather do this with the pygit2 module in Python. It's simple enough to access a repository that you already have on disk by just calling pygit2.Repository(repo_path). If repo_path is the path to a valid Git repository this opens up a handle you can use to inspect it.

My next task was logically just go walk all the files in the repository. In my mind this should be simple. But in Git, any concept of a file in the reopsitory has to have some associated revision. The repository object has a .walk function but that is used to walk the commits starting at a certain revision. To get access to any specific revision you can use the function .revparse_single and pass it a string argument. If you pass it the name of a branch like master you get a branch object. A branch object has a .tree attribute. I thought initially this would be something that could iterate over all the files currently in that branch but that isn't the case.

As it turns out what we commonly call a directory in a Git repo is actually just another tree object. You can't actually check in a directory to Git that I am aware of, so this makes a certain degree of sense. So by iterating over the .tree of a branch we encounter both files and more "trees". These in turn need to be iterated over. The obvious solution to this is to use recursion, but I was able to come up with an iterative algorithm that I think is simpler.

 list_all_files_pygit2.py 893 Bytes

import pygit2 # using version 1.15.1 in this example
import sys
import os

def walk_repo_files(repo, branch):
    tree = repo.revparse_single(branch).tree
    trees_and_paths = [(tree, [])]
    # keep going until there is no more data
    while len(trees_and_paths) != 0:
        tree, path = trees_and_paths.pop() # take the last entry
        for entry in tree:
            if entry.filemode == pygit2.GIT_FILEMODE_TREE:
                next_tree = repo.get(entry.id)
                next_path = list(path)
                next_path.append(entry.name)
                trees_and_paths.append((next_tree, next_path,))
            else:
                yield os.path.join(*path, entry.name)

repo_path = sys.argv[1]
branch_name = sys.argv[2]

repo = pygit2.Repository(repo_path)

for entry in walk_repo_files(repo, branch_name):
    sys.stdout.write(entry)
    sys.stdout.write("\n")

This example works by starting from the branch reference obtained from .revparse_single(). That branch references a single tree. Each a subtree is encountered the reference it refers to is resolved by calling .get() with the ID. This is added to the list of trees to be processed. You can run this example like this

$ python3 list_all_files_pygit2.py /home/ericu/builds/Meridian59/ master
.gitignore
LICENSE
MeridianPalette.BMP
README
blakston.pal
common.mak
common.mak.linux
makefile
rules.mak
rules.mak.linux

This prints out every file in the repository on a specific branch.

Sorted output

The above example works but I wanted the output to be sorted. Specifically, I want the order you see on any website like GitHub or BitBucket with all the directories first and all file entries last. Everything should be sorted alphabetically, but ignoring the case of the letters.

The algorithm I use above added directories to be processed as they were encountered. What I did was save all of these into a list, then sort the list. The list is then copied onto the stack of trees remaining to be processed. This has the effect of reverse the sorting used.

The iteration I already had would produce all the files in a given directory one after the other. I changed that to keep all the files in a list. The list is sorted in reverse order then each result is yielded to the caller.

This has the effect of producing the desired sorting in reverse order. So I just reverse the entire list at this point and I have the output I want.

 list_all_files_sorted_pygit2.py 1.8 kB

import pygit2 # using version 1.15.1 in this example
import sys
import os

def walk_repo_files(repo, branch):
    tree = repo.revparse_single(branch).tree
    trees_and_paths = [(tree, [])]
    # keep going until there is no more data
    while len(trees_and_paths) != 0:
        tree, path = trees_and_paths.pop() # take the last entry

        new_trees = []
        for entry in tree:
            if entry.filemode == pygit2.GIT_FILEMODE_TREE:
                next_tree = repo.get(entry.id)
                next_path = list(path)
                next_path.append(entry.name)
                new_trees.append((next_tree, next_path, entry.name))
            else:
                yield path, entry

        # sort by the last element in the tuples, the name
        new_trees.sort(key = lambda x: x[2].lower())
        trees_and_paths.extend( (a,b) for a, b, _ in new_trees)


def walk_repo_files_reverse_sorted(repo, branch):
    walker = walk_repo_files(repo, branch)
    try:
        prev_path, entry = next(walker)
    except StopIteration:
        return # repo is empty
    accum = [entry]
    for path, entry in walker:
        if prev_path != path:
            # sort elements in reverse order by name
            accum.sort(key = lambda x: x.name.lower(), reverse=True)
            for i in accum:
                yield prev_path, i
            prev_path = path
            accum.clear()

        accum.append(entry)


repo_path = sys.argv[1]
branch_name = sys.argv[2]

repo = pygit2.Repository(repo_path)

all_files = list(walk_repo_files_reverse_sorted(repo, branch_name))
all_files.reverse()
for path, entry in all_files:
    sys.stdout.write(os.path.join(*path, entry.name))
    sys.stdout.write("\n")

License

All python source code on this specific web page is available under the following license

Copyright (c) 2024 Eric Urban

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Copyright Eric Urban 2024, or the respective entity where indicated