It turns out that the issue seems to be when there is no diffstat for a given commit and the last line read is appended back to the list:
if not l.startswith(" "): # If there is no starting space, it means there were no stats rows, # and we're already looking at the next commit. Put this line back # on the list and move on lines.append(l) break
This ends up adding a regular str (and not a byte string) to the list, which blows up later when we try to do a 'decode' on it, in the next pass.
Instead, I believe this should be:
if not l.startswith(" "): # If there is no starting space, it means there were no stats rows, # and we're already looking at the next commit. Put this line back # on the list and move on lines.append(l.encode('utf-8')) break
Local testing showed that to fix it, but I'd like to give Magnus an opportunity to review and confirm that this fix makes sense before changing things on the git server.
Nice spot. I think you have at least found the issue, but I think it may not be the best fix. Given that when we decode the string we do it with errors=ignore, we might loose data. Does the attached patch fix it in your tests as well? Instead of encoding/recoding, it just sticks the old line back on the list (which also matches the comment).