Handling of git repositories

Hello,

I posted this earlier in the german forum before I read about the “current situation” and that the forum accounts of the actual seafile developers have been closed. So I post this here again:

I plan to use the 3 person pro v6 of seafile once its out (hopefully in two weeks). I want to
use it primarily for my git repositories. I read in past seafile
versions this (could) have caused errors. What is (or will be in v.6) the handling of git repositries?

  1. Is the .git folder just ignored (actually I’d like to also backup my history if it’s possibel)?
  2. Is it backed up but without history (so like a simple copy to a network drive)?
  3. Is it handled like any other file? So is there a history of a history (seafile-history of the local-git-history)?

thanks in advance

Hi,

Seafile handles .git directory just like any other folders. Seafile just borrows ideas from Git but don’t use git internally.

Hi! I ressurect this old thread because I always end up here every time I try to search for a solution to my problem.

It happens to me that Seafile gets very confused by git repositories. If I keep a git repository in a Seafile directory, when I do a git commit, Seafile often fails to sync the git internal data properly. I don’t know the details of what happens under the hood, but the impression is that Seafile may be confused by the multitude of filesystem operations performed by git in rapid sequence. Perhaps a race condition of some sort?

This problem has persisted, for me, since I started using Seafile many years ago. It is essentially the only source of frustration that I have with this software. By the way, let me take a moment to express my gratitude to the developers for so many years of reliable service. I understand that dealing with filesystems cross-platform is complicated: please don’t take this message as an expression of criticism.

Is it a known issue?

I encounter the same problems. Two or three times, seafile messed up my git folder. I ended up excluding git folders from syncing (which is a pain when you work on a git project on multiple machines)…

1 Like

Now I realize that this is not a server problem, so I am in the wrong forum. Sorry!

Indeed, it is more likely to be a issue related to the linux client. Giving a superficial look at the code, however, I couldn’t find the bug. (I suspected a couple of lines in wt-monitor-linux.c, but “fixing” them didn’t solve anything.)

A rather crude workaround that, for me, seems to update git repositories reliably is doing

seaf-cli stop
find /path/to/repository.bare -exec touch {} \;
seaf-cli start

after pushing to repository.bare. (This runs touch on all files in the repository.)

I use git a little, but not much, and I’ve never seen this. I worry that I might see it in the future, so I’d like to ask for some details so I can try to reproduce it.

Are you using the seafile agent, or seadrive? It this on linux, mac, or windows? Does it eventually sort itself out (maybe after a few minutes), or do you need to do the above to clean it up?

Also what’s specifically happens? Something like the sever ends up missing files from the .git directory, or it has old versions but thinks it is current?

Hi! I am on linux Debian Trixie, filesystem is zfs, but I think that the problem occurred to me also on ext4 (I can’t remember for sure though). I use the seafile client provided by Debian, but I also tried to compile from sources. The problem is, with all probability, either in my specific setup or in seafile-daemon: probably not in the server.

To me, it happens that seafile occasionally misses files: usually the blobs in .git/objects. The only way I found to force seafile to upload the missing files is through the commands above. In particular, it is not sufficient to wait nor to do just “seaf-cli stop” then “seaf-cli start”. It is hard, for me, to reproduce the problem, because it happens sporadically. Yet, when it happens, it does not sort itself out.

The linux seafile client spawns seafile-daemon which uses inotify to monitor directories for changes. The inotify mechanism is treacherous: I couldn’t figure out exactly how those files end up being forgotten.

Thanks for the details.

I’m on CachyOS (arch) on ZFS myself, with the server on Debian. Your description sounds like I could have had the problem and just didn’t realize it, so I decided to spend some time trying to reproduce this issue.

So here are some steps I went through to reproduce. First, I am using the latest version as AppImages of seafile client and seadrive. I made a test repo, and synced it to a directory on my computer with seafile agent. In that sync directory, I created a new git repo (git init). I made some files, checked them in, waited for the seafile client to say it was done uploading, and then checked through SeaDrive to see if the server knew the files were there. The first few times everything worked without problem.

The first time I reproduced the issue went like this. Created 1024 1MB files, checked them in, and verified sync:

 mkdir a
 cd a
 for a in {1..1024} ; do dd if=/dev/urandom of=test_$a bs=1M count=1 & done
 cd ..
 git commit

Wait, for sync to finish, then compare files:

 diff <(find -type f | sort) <(cd ~/SeaDrive/My\ Libraries/test/git ; find -type f | sort )
 < ./.git/objects/63/571b36282334585de85a2b710ca6d7a54449a6
 find -newer .git/objects/63/571b36282334585de85a2b710ca6d7a54449a6

There’s a file in .git/objects in the source directory that wasn’t synced to the server. This file is the most recently created file in the git directory. It looks like somehow the very last file got lost.

A couple of other changes synced fine. So I tried another run with more files:

 mkdir c
 cd c
 for a in {1..2048} ; do dd if=/dev/urandom of=test_$a bs=1M count=1 & done
 cd ..
 git add
 git commit

Wait for sync to complete. Then:

 diff <(find -type f | sort) <(cd ~/SeaDrive/My\ Libraries/test/git ; find -type f | sort ) | wc -l
 ...
 < ./.git/objects/ff/6ac0d85eacb6e7db70927b2f71e41c612e3108
 6691,6692d4641
 < ./.git/objects/ff/d9f48e16dbd067a65a1cd04026909a18f777a7
 < ./.git/objects/ff/edfd8a3d359db63e441630a822ae6c28e2afd3

This time 2051 files are missing. I think this suggests that the list of changed files that need to be synced (either from inotify, or something internal) can be overrun and some of it lost. There’s nothing in the agent logs about a problem.

In the seafile client UI, if I right-click the library, and click “sync now” that causes it figure out what’s missing, and catch back up.

So I suspect that I’ve never noticed this before because I don’t have any large git projects. Including files in .git my largest is about 800 files.

I was going to file a bug report on github, but there is one already. For reference here it is: `seaf-cli` not syncing all files in `git` repository · Issue #2833 · haiwen/seafile · GitHub

Hi,

Git creates metadata files using hard links in .git. In Linux, when a hard link is created using the link() system call or the ln command, inotify triggers two different types of events because this process both creates a new directory entry and modifies the file’s metadata.

Seafile client indexes files only after detecting the IN_CLOSE_WRITE event via inotify. However, files created via hard links generate only an IN_CREATE event and do not produce an IN_CLOSE_WRITE event, which causes such files not to be indexed. If the file were processed at the IN_CREATE stage, it could be indexed while it is still being written, leading to issues.

At present, we do not have plans to support this edge case.

Thanks @feiniks for unmasking the bug!

I respect the dev team’s decision not to support this case. Yet, please allow me to offer a solution. I have a fix for seafile-daemon here on github: https://github.com/vacaboja/seafile I will also submit a pull request presently.

Plea

Before I go into the details of the fix, let me bother the dev team with a little plea. Of course, since I want compatibility with git, I can use my own fork of seafile-daemon, and I am content with that. Nevertheless, I wish to ask the dev team to reconsider their position on this issue. Messing up users’ git repositories really isn’t a good look. Especially considering that many people willing to self host their own cloud storage are likely to be nerds like myself, thus likely to use git. Besides, anyone who knows that seafile does not sync git repositories reliably, will naturally be wary of the reliability of the system in general. I tried to make fixing the bug as easy as possible: in fact, I believe that you only have to accept the pull request. Of course, I am willing to revise the code to meet any concerns.

Details of the bug

Given the explanation offered by @feiniks, one wonders why, in the tests performed by @tomservo, only occasional files went missing? In fact, things being as by @feiniks, seafile-daemon should miss every single object file saved by git, not just a few.

Well, @feiniks is of course right, but there is more to the story. Seafile indexes files on IN_CLOSE_WRITE and it does not on IN_CREATE, however there are other inotify events that may trigger indexing. One is IN_MODIFY, which does not concern us here. Another one, however, is IN_CREATE on the containing directory. In fact, in wt-monitor-linux.c the function add_watch_recursive() will cause any file present in a newly created directory to be indexed. This is precisely how most of the files in .git/objects get indexed in @tomservo tests. In fact, in those tests, git creates the subdirectories inside .git/objects and the files therein in rapid succession, and, by the time seafile-daemon gets to process the events, git ha already completed most of its work. Understanding this, it becomes clear that a more reliable way to trigger the bug is to create the repository, commit some files to it, then wait for seafile to sync, and then finally perform a new commit. The following script reliably triggers the bug for me, causing seafile to miss about 50 of git’s internal files, even by merely creating 200 1kb files.

mkdir test
cd test
 
git init
 
mkdir files1
 
for A in `seq 100`
do
        dd if=/dev/urandom of=files1/file$A bs=1024 count=1
done
 
git add files1
git commit -m foo
 
sleep 60
 
mkdir files2
 
for A in `seq 100`
do
        dd if=/dev/urandom of=files2/file$A bs=1024 count=1
done
 
git add files2
git commit -m bar

The command sleep 60 is to give seafile some time to index the first commit.

The fix

As correctly pointed out by @feiniks there is no way to tell whether a IN_CLOSE_WRITE event comes from an open() syscall or a link() syscall. So we need to guess and index the file when we guess link(). What happens if we guess wrong?

  1. We guess open() and it was a link(). Then we miss a file. This is the bug. This is very bad.
  2. We guess link() and it was a open(). Then we index a possibly incomplete file. This is tollerable as long as we don’t do it too much. Indeed seafile-daemon indexes incomplete files already in multiple occasions: on IN_MODIFY and when IN_CREATE fires on the containing directory, for instance. This is not a big deal because any file being written to will eventually get its own IN_CLOSE_WRITE event, and then it will be indexed with its final content.

So we need to make a reasonably reliable guess with no false opens.

The guess strategy in my code is as follows. When an IN_CREATE event fires on a regular file, we don’t index the file, but we remember this event for awhile. After about one second, if the same file has received any event causing it to be either indexed or deleted, or if we detect any change in the file through its last modification time, then we forget about the whole thing. If not, either it was created by link() or, at least, it is not being actively written to. Then, in this case, the safest bet is to index the file.

The implementation is straightforward. Looking at the wt-monitor-linux.c you see that the one second timer is created by imposing a timeout to the select() syscall in wt_monitor_job_linux(). The exact timing of this is very uncritical, so who cares if, under stress, it might fire less frequently. Let’s say that it fires at seconds n-1, n, n+1, n+2, etc. When second n+2 fires, we produce internal WT_EVENT_CREATE_OR_UPDATE events for all unchanged files created between seconds n and n+1 (so this files are 1 to 2 seconds old). This is effected by the two hash tables recheck_next and recheck_accu in the RepoWatchInfo structure. Specifically, at any time, files receiving IN_CREATE events are accumulated in the recheck_accu hash, while the files in recheck_next are those accumulated during the previous second. When the seconds timer fires, we deal with all the files in recheck_next, then we swap the now empty recheck_next with recheck_accu. Every time we fire a WT_EVENT_CREATE_OR_UPDATE on a file that is either in recheck_accu or recheck_next, we expunge that file.

Thanks for reading to the end of this.

And thanks to the dev team for their work maintaining seafile.

1 Like

Hi everyone,

I’ve been using my tentative fix for seaf-daemon for a couple of weeks with no issues.

For anyone interested, I made a quick Debian package for x86-64 with installation instructions on my GitHub repo. I’d love to hear feedback if anyone else gives it a try.

Use at your own risk.

Thanks for you detailed analysis and PR. Since syncing git repository with Seafile is a rare use case, we don’t plan to fix this bug. Fixing this bug may affect handling of more common cases.

Thank you @Jonathan for the reply. I understand and respect the developer team’s position on this issue.

For those of us who do rely on Seafile to sync Git repositories, could you please clarify what are the more common use cases that are negatively affected by fixing this bug? It would help us better understand the trade-offs involved.

Copying a large file into a synced folder may be affected by this change. Indexing a file when it’s being copied will fail. I remember we fixed such bugs many years ago. This case is clarified in a comment in the code (wt-monitor-linux.c).

        /* Nautilus's file copy operation doesn't trigger write events.
         * If the user copy a large file into the repo, only a create
         * event and a close_write event will be received. If we process
         * the create event, we'll certainly try to index a file when it's
         * still being copied. So we'll ignore create event for files.
         * Since write and close_write events will always be triggered,
         * we don't need to worry about missing this file.
         */

We don’t want to change to a more complex logic, as different kernels or desktop environment may have different behavior with regards to writing to a file. The current logic works for most case. (No one complains about it in more general cases.) So we want to keep it.

Thank you @Jonathan for the explanation. I understand and respect the desire to avoid changes in the client’s algorithm.

For my use case, I’m willing to accept that risk. For the time being, I’ll continue to maintain my fork for personal use and for anyone else who may find it useful.

Thanks again for taking the time to clarify the rationale.