Fixing a Corrupt Git Index
I’ve been building a git implementation in Ruby called Jit as part of a Recurse Center study group that’s reading Building Git by James Coglin.
I finished building a feature and was ready to commit. I staged the modified files via jit add
and ran jit status
when I encountered this output:
% jit status
fatal: Checksum does not match value stored on disk
Uh oh, had I corrupted my index file somehow?
Inspecting the Index on disk
The checksum refers to the final 20 bytes stored in .git/index
which is a sha1sum
of the everything in the index that precedes those final 20 bytes.
Okay, let’s compute the checksum via the command line and inspected the last 20 bytes of my .git/index
file
% wc -c .git/index
3526 .git/index
% cat .git/index | head -c 3506 | sha1sum
fdcf9b375d7f21c2a90aa68440925d5976ae2e54 -
% cat .git/index | tail -c 20 | hexdump -C
00000000 fd cf 9b 37 5d 7f 21 c2 a9 0a a6 84 40 92 5d 59 |...7].!.....@.]Y|
00000010 76 ae 2e 54 |v..T|
00000014
What the heck? The hashes match!
Let’s throw a breakpoint in my code and inspect what’s going on:
% jit status
From: jit/lib/index.rb:87 Index#load:
80: def load
81: clear
82: file = open_index_file
83: if file
84: reader = Checksum.new(file)
85: count = read_header(reader)
86: read_entries(reader, count)
=> 87: binding.pry
88: reader.verify_checksum
89: end
90: ensure
91: file&.close
92: end
[1] pry(#<Index>)> reader.digest.digest.bytes.map { |byte| byte.to_s(16) }.join
=> "ef7552992140e6169ca2cf2ba9db2bc47115d96"
Well, that’s definitely a different checksum. So what’s happening here?
A Higher Level Perspective
It might help to explain what the status command is doing.
The first thing the status
command command does is read .git/index
into memory before using that in-memory representation to determine if any files in your workspace have changed and if anything in your index differs from your previous commit. The checksum error I’m getting happens in that very first step, reading the index into memory.
Examining the Code
# lib/index.rb
def load
clear
file = open_index_file
if file
reader = Checksum.new(file)
count = read_header(reader)
read_entries(reader, count)
reader.verify_checksum
end
ensure
file&.close
end
The first 12 bytes of .git/index
constitute a header that specifies (amongst other things) how many entries are stored in the index. That’s the return value of read_header
which we store in count
.
Looking at the read_entries
method shows this:
# lib/index.rb
def read_entries(reader, count)
count.times do
entry = reader.read(ENTRY_MIN_SIZE)
until entry.byteslice(-1) == "\0"
entry.concat(reader.read(ENTRY_BLOCK))
end
store_entry(Entry.parse(entry))
end
end
Each index entry is null terminated, so read_entries
reads from the index until it finds a null byte. It then stores that entry in memory, repeating that process count
times.
What’s that reader object doing?
# lib/index/checksum.rb
def initialize(file)
@file = file
@digest = Digest::SHA1.new
end
def read(size)
data = @file.read(size)
raise EndOfFile, 'Unexpected end-of-file while reading index' unless data.bytesize == size
@digest.update(data)
data
end
def verify_checksum
sum = @file.read(CHECKSUM_SIZE)
unless sum == @digest.digest
raise Invalid, 'Checksum does not match value stored on disk'
end
end
The reader
object reads a specified number of bytes from a file, updates the sha1sum
of everything it’s read, and returns the read chunk of data. verify_checksum
reads 20 bytes from the file and compares that the current digest value, throwing an error if they don’t match.
A Hypothesis
After looking at my code, it seems that we’re blindly trusting the value of count
that’s returned from reading the index header. If that count
value is wrong then verify_checksum
will return the wrong 20 bytes and everything will break. It’s possible that we’re reading count
incorrectly, or the value in the header is incorrect, and both are pretty easy to check.
Hexdump the Header
First let’s see what code is saying the value of count
is:
From: jit/lib/index.rb:87 Index#load:
80: def load
81: clear
82: file = open_index_file
83: if file
84: reader = Checksum.new(file)
85: count = read_header(reader)
86: read_entries(reader, count)
=> 87: binding.pry
88: reader.verify_checksum
89: end
90: ensure
91: file&.close
92: end
[1] pry(#<Index>)> count
=> 38
According to the documentation, the index header contains a 4-byte signature “DIRC”, followed by a 4-byte version number, and 32-bit number of index entries.
% cat .git/index | head -c 12 | hexdump -C
00000000 44 49 52 43 00 00 00 02 00 00 00 26 |DIRC.......&|
0000000c
0x26
is 38 in decimal, so it seems my code and the index header agree on the number of entries to expect.
Manually Count the Index Entries
Each index entry contains the pathname of the entry in uncompressed ascii format, making it relatively easy to count the number of entries by hand via hexdump.
I hexdumped the whole index and started counting the entries. After the 38th entry I found something weird:
Here’s the last 400 bytes of my .git/index
file:
index_test.rb
is the 38th entry, so I should expect to find the checksum immediately after it.
% cat .git/index | tail -c 400 | hexdump -C
00000000 00 11 04 5b 84 1e 00 00 81 a4 00 00 01 f5 00 00 |...[............|
00000010 00 14 00 00 05 2e b1 45 46 a8 0f 64 93 b6 20 10 |.......EF..d.. .|
00000020 aa 3b 6e 41 bf 8e c2 8e a4 02 00 12 74 65 73 74 |.;nA........test|
00000030 2f 69 6e 64 65 78 5f 74 65 73 74 2e 72 62 00 00 |/index_test.rb..|
00000040 00 00 00 00 00 00 54 52 45 45 00 00 01 2e 00 33 |......TREE.....3|
00000050 38 20 34 0a e6 a2 e7 72 4e 4d b7 97 cd e9 6a e2 |8 4....rNM....j.|
00000060 b9 bc 4d 7b ae 98 3e 6c 62 69 6e 00 32 20 30 0a |..M{..>lbin.2 0.|
00000070 d2 f1 e0 40 39 09 27 01 a4 eb 00 a8 fb 64 b6 4f |...@9.'......d.O|
00000080 47 63 9e b1 6c 69 62 00 32 34 20 34 0a c8 1d 17 |Gc..lib.24 4....|
00000090 0e ff 81 4c a5 69 4c e6 8f 53 bb 79 e8 01 db 62 |...L.iL..S.y...b|
000000a0 f5 64 69 66 66 00 31 20 30 0a f5 69 d3 e8 5d 57 |.diff.1 0..i..]W|
000000b0 2a d4 af e6 aa 8a 0a 4f b2 cc 6e 6b 61 71 69 6e |*......O..nkaqin|
000000c0 64 65 78 00 32 20 30 0a 15 0a b2 e4 86 42 7e de |dex.2 0......B~.|
000000d0 90 8e b4 0c 88 d6 62 d8 5f a0 52 c2 63 6f 6d 6d |......b._.R.comm|
000000e0 61 6e 64 00 35 20 30 0a 09 56 b3 03 77 e5 ec 93 |and.5 0..V..w...|
000000f0 1e ad 3a f5 a3 57 2a ee fe a0 d8 c1 64 61 74 61 |..:..W*.....data|
00000100 62 61 73 65 00 35 20 30 0a 84 0d cd 7d 0b 7c 42 |base.5 0....}.|B|
00000110 7a 7a 57 aa c3 d2 1b 6c 8c 32 e3 cc 72 74 65 73 |zzW....l.2..rtes|
00000120 74 00 34 20 31 0a 2a 2d 76 65 d6 35 ee cb a1 df |t.4 1.*-ve.5....|
00000130 71 07 7d 7f 29 aa b3 5f e9 13 63 6f 6d 6d 61 6e |q.}.).._..comman|
00000140 64 00 32 20 30 0a 2d dd 06 b9 ec ce 35 fd a6 62 |d.2 0.-.....5..b|
00000150 ca 7b c9 d9 d9 b9 22 99 cc 06 2e 72 75 62 79 2d |.{...."....ruby-|
00000160 6c 73 70 00 34 20 30 0a 98 d0 54 74 23 af 4a 47 |lsp.4 0...Tt#.JG|
00000170 d2 c2 0d 46 7f 9e 24 8a 0d 78 f1 67 fd cf 9b 37 |...F..$..x.g...7|
00000180 5d 7f 21 c2 a9 0a a6 84 40 92 5d 59 76 ae 2e 54 |].!.....@.]Yv..T|
00000190
What the heck? There’s the word TREE
followed by a bunch of binary data. That’s some extra data, but it’s definitely not the format of a valid index entry.
It almost looks like I’ve written the content of a TREE
database object directly into my index. But I’ve been making commits and staging files in this project for a while. If I was writing arbitrary data into the index I surely would have encountered an issue sooner.
Also, looking closer at the hexdump we see that mixed amongst the binary are the strings bin
, diff
, command
, database
, test
, command
, and .ruby-lsp
. These correspond to all the subdirectories in my workspace, but interestingly there’s no reference to any actual files in this rogue TREE
object. The trees stored in the database contain references to both files and directories, so there’s no way that I’ve written a database tree’s content into my index.
% find . -type d -not -path "*.git*" -not -path .
./test
./test/command
./bin
./.ruby-lsp
./lib
./lib/database
./lib/diff
./lib/index
./lib/command
Documentation to the Rescue
I had pulled up the index format documentation for git while debugging. There are a lot of references to trees in that page, but one in particular stands out:
Cache tree
Since the index does not record entries for directories, the cache entries cannot describe tree objects that already exist in the object database for regions of the index that are unchanged from an existing commit. The cache tree extension stores a recursive tree structure that describes the trees that already exist and completely match sections of the cache entries. This speeds up tree object generation from the index for a new commit by only computing the trees that are “new” to that commit. It also assists when comparing the index to another tree, such as
HEAD^{tree}
, since sections of the index can be skipped when a tree comparison demonstrates equality.The recursive tree structure uses nodes that store a number of cache entries, a list of subnodes, and an object ID (OID). The OID references the existing tree for that node, if it is known to exist. The subnodes correspond to subdirectories that themselves have cache tree nodes. The number of cache entries corresponds to the number of cache entries in the index that describe paths within that tree’s directory.
The extension tracks the full directory structure in the cache tree extension, but this is generally smaller than the full cache entry list. When a path is updated in index, Git invalidates all nodes of the recursive cache tree corresponding to the parent directories of that path. We store these tree nodes as being “invalid” by using “-1” as the number of cache entries. Invalid nodes still store a span of index entries, allowing Git to focus its efforts when reconstructing a full cache tree.
The signature for this extension is { ‘T’, ‘R’, ‘E’, ‘E’ }.
A series of entries fill the entire extension; each of which consists of:
…
That must be it! In fact, if I run git status
it seems git
has no issue reading my index file. That would make sense seeing as this is an “index extension” and therefore unsupported by my own implementation.
How did this happen?
Digging around my .git
folder I found the following in .git/logs/HEAD
09bda1a2eabf18826997349410032f3d61229345 5704864b6c45d0991a1671fb196a51aa581b4398 Garrett-Bodley <garrett.bodley@gmail.com> 1710967835 -0400 commit: Stub out the Myer's Diff algorithm.
That’s my most recent commit! But my code doesn’t have any logic pertaining to .git/logs/HEAD
.
I must have been on autopilot and used git commit
instead of my own jit commit
command. Git then mutated my index and added on the “Cache Tree” extension when making the commit. My own implementation has no concept of a “Cache Tree” so that addition caused everything to break.
Fixing the problem
Thankfully the solution to all this is straightforward: delete the index file.
Running jit add .
will generate a new index file that is tracking the entire workspace. After deleting and recreating the index, jit status
works entirely as expected.
Post Mortem
I learned that while git
works in jit
repositories, the reverse is not true. Git is a complicated tool, and the pared down version that I’ve implemented doesn’t cover all the edge cases that exist in the main tool.
It’s rather unclear to me when a Cache-Tree gets generated. Clearly one is made when making a commit, but not when calling git status
. I did a quick google search and haven’t been able to find any posts on the subject. It seems like it exists to speed up the status command for large repositories. I tried searching the Building Git book for any references, but it doesn’t seem to mention cache trees at all. Julia Evans, who has been digging deep into git lately, doesn’t have any blog posts about cache trees either.
I also learned about .git/logs
, which provided me concrete evidence of what had gone wrong. Logging is super useful, and I’m glad the main git
tool keeps internal records for moments like these.
I’m tempted to try to add support for Cache Trees in my own implementation, or at least learn to skip past it to find the real checksum if one is present. But trying to match all of Git’s functionality 1 to 1 feels like it could lead to an endless series of yak shaves.
All in all I’m left with greater respect for the amount of engineering that goes into a system like Git. There are so many edge cases, and small details that have been built up over the year, and trying to build my own version has given me perspective on just how much work has gone into this ubiquitous tool (though I’ll try to remember to avoid git
commands in my Jit repo)