Listing all files in a Git repository with pygit2
One of the more interesting programming tasks I've had recently is trying to list all the files in a Git repository programmatically. One method to do this would be to just checkout a repository and then walk the filesystem. But I'd much rather do this with the pygit2
module in Python. It's simple enough to access a repository that you already have on disk by just calling pygit2.Repository(repo_path)
. If repo_path
is the path to a valid Git repository this opens up a handle you can use to inspect it.
My next task was logically just go walk all the files in the repository. In my mind this should be simple. But in Git, any concept of a file in the reopsitory has to have some associated revision. The repository object has a .walk
function but that is used to walk the commits starting at a certain revision. To get access to any specific revision you can use the function .revparse_single
and pass it a string argument. If you pass it the name of a branch like master
you get a branch object. A branch object has a .tree
attribute. I thought initially this would be something that could iterate over all the files currently in that branch but that isn't the case.
As it turns out what we commonly call a directory in a Git repo is actually just another tree object. You can't actually check in a directory to Git that I am aware of, so this makes a certain degree of sense. So by iterating over the .tree
of a branch we encounter both files and more "trees". These in turn need to be iterated over. The obvious solution to this is to use recursion, but I was able to come up with an iterative algorithm that I think is simpler.
This example works by starting from the branch reference obtained from .revparse_single()
. That branch references a single tree. Each a subtree is encountered the reference it refers to is resolved by calling .get()
with the ID. This is added to the list of trees to be processed. You can run this example like this
$ python3 list_all_files_pygit2.py /home/ericu/builds/Meridian59/ master .gitignore LICENSE MeridianPalette.BMP README blakston.pal common.mak common.mak.linux makefile rules.mak rules.mak.linux
This prints out every file in the repository on a specific branch.
Sorted output
The above example works but I wanted the output to be sorted. Specifically, I want the order you see on any website like GitHub or BitBucket with all the directories first and all file entries last. Everything should be sorted alphabetically, but ignoring the case of the letters.
The algorithm I use above added directories to be processed as they were encountered. What I did was save all of these into a list, then sort the list. The list is then copied onto the stack of trees remaining to be processed. This has the effect of reverse the sorting used.
The iteration I already had would produce all the files in a given directory one after the other. I changed that to keep all the files in a list. The list is sorted in reverse order then each result is yield
ed to the caller.
This has the effect of producing the desired sorting in reverse order. So I just reverse the entire list at this point and I have the output I want.
License
All python source code on this specific web page is available under the following license
Copyright (c) 2024 Eric Urban Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.