Index Size Limits / Directory

Hi,

I’m on Pro 6.1.4 version.
I wonder how the index process is managed from Seafile to elastic Search.
I have a index size wich is larger than 2.5.GB, for 242 libraries and 163307 files.
I have a lot of PDF files which are indexed by default.

I’d like to know :

  • Which files are indexed (only the last one in the history ? not the deleted ones ? )
  • If there is a way to limit the index scope, such as :
  • the number of page indexed, like for the preview process
  • the stop words or any build-in feature in elastic search that could be set in Seafile settings
  • If there is a way to store the index file in another directory than Seafile, whose size is limited

Regards,

Gautier

Hi,

I made a clean + build action on the index, it saved 10% of the size.

But @daniel.pan, i noticed another potential space problem in the thumbnail

seahub-data/thumbnail

wihch is 1.2 Gb.

I think this folder could be cleaned from time to time to save disk space.

And, “local” folders could be located apart from the application.

seahub_data/
pro-data/

Is there any config file to do so ?

I tried to edit seahub_settings.py and change THUMBNAIL_ROOT

FILE_PREVIEW_MAX_SIZE = 30 * 1024 * 1024
THUMBNAIL_ROOT = '/nfs/seafile-thumbnail/thumb'

I restarted seafiel.
But the new thumb folder is blank, whereas the old foder is active (/thumb is full)

Is there any conflict between :

ENABLE_THUMBNAIL = FALSE
THUMBNAIL_ROOT = ‘/nfs/seafile-thumbnail/thumb’

Why do i have /thumb full while ENABLE_THUMBNAIL = FALSE ?

Regards,

Gautier

There is no such configuration. But you can easily move the folders to another place and create a symbolic link in the original place.

I will look into the other problems later.

Thank you daniel. I did so.

Regards,

Gautier

Hi,

The index of Seafile is 20Go for an amount of 500 GB in libraries.
The VM disk is quite full.

I wonder :

  • if 20Gb is normal
  • if it is possible to locate the index on an NFS mount share (potential perfomance issue) and symlink to it as you mention
  • if we could size the index with new global options like index_office_pdf = true
  • if we could add a new option to libraries (do not index this library)

Regards

20GB is a little too much for 500GB libraries. Maybe you can try rebuild the index?

Yes, it is possible.

You can turn off index_office_pdf to reduce the index size a lot. But it can’t be turn off per library.

Hi,

I sized up the disk to 50Gb, but the index took all the available space.
I cleared the index, but before rebuilding it, i’d like to understand what happened.

The index.log file says that there was two concurrent processes.
They are driven by root privileges, mayvbe because i started seafile as a service

sudo service seafile-server start

02/06/2018 02:39:01 [INFO] root:194 main: storage: using filesystem storage backend
02/06/2018 02:39:01 [INFO] root:196 main: index office pdf: True
02/06/2018 02:39:01 [ERROR] seafes:248 check_concurrent_update: another index task is running, quit now
02/06/2018 02:44:31 [ERROR] seafes:103 thread_task: Error when index repo 0daaf670-ba81-4161-a1a0-7916ce993980
Traceback (most recent call last):
  File "/home/cc/seafile/seafile-pro-server-6.2.8/pro/python/seafes/index_local.py", line 98, in thread_task
    self.fileindexupdater.update_repo(repo_id, commit_id)
  File "/home/cc/seafile/seafile-pro-server-6.2.8/pro/python/seafes/file_index_updater.py", line 78, in update_repo
    self.check_recovery(repo_id)
  File "/home/cc/seafile/seafile-pro-server-6.2.8/pro/python/seafes/file_index_updater.py", line 74, in check_recovery
    self.update_files_index(repo_id, old, new)
  File "/home/cc/seafile/seafile-pro-server-6.2.8/pro/python/seafes/file_index_updater.py", line 63, in update_files_index
    self.files_index.delete_files(repo_id, deleted_files)
  File "/home/cc/seafile/seafile-pro-server-6.2.8/pro/python/seafes/indexes/repo_files.py", line 212, in delete_files
    self.bulk(actions, ignore_not_found=True)
  File "/home/cc/seafile/seafile-pro-server-6.2.8/p02/06/2018 10:09:04 [INFO] root:194 main: storage: using filesystem storage backend
02/06/2018 10:09:04 [INFO] root:196 main: index office pdf: True
02/06/2018 10:09:04 [INFO] seafes:147 start_index_local: Index process initialized.
02/06/2018 10:09:04 [INFO] seafes:52 run: starting worker0 worker threads for indexing
02/06/2018 10:09:04 [INFO] seafes:52 run: starting worker1 worker threads for indexing
02/06/2018 10:09:06 [INFO] seafes:82 update_repo: Updating repo 022257ce-5b1f-487f-a8c2-6af5ab5c6c02
02/06/2018 10:09:06 [INFO] seafes:82 update_repo: Updating repo 03320930-691d-45a3-bb43-4122d07fb69b
02/06/2018 10:09:58 [INFO] seafes:82 update_repo: Updating repo 0452db72-44d1-4766-a2e4-d316315f1861
02/06/2018 10:10:00 [INFO] seafes:82 update_repo: Updating repo 0482f3fe-2624-4611-bca9-1108d4dd1e87
02/06/2018 10:10:08 [INFO] seafes:82 update_repo: Updating repo 04d43a9b-e73c-400e-9505-21bf3d33f9fd
02/06/2018 10:10:11 [INFO] seafes:82 update_repo: Updating repo 06ef276e-550c-45de-bb0d-a66a6f3aa180
02/06/2018 10:10:11 [INFO] seafes:82 update_repo: Updating repo 085a5ab9-25bb-486e-8a6b-42e9ad27db91
02/06/2018 10:10:11 [INFO] seafes:82 update_repo: Updating repo 0942c690-cbc5-49c8-a7fa-2efaadff71ef
02/06/2018 10:12:48 [INFO] seafes:82 update_repo: Updating repo 09b5f980-3a51-4451-937f-58a72ac999a5
02/06/2018 10:12:48 [INFO] seafes:82 update_repo: Updating repo 0daaf670-ba81-4161-a1a0-7916ce993980

Then the elasticsearch.log notified that the disk was full

[2018-02-06 03:25:18,861][WARN ][cluster.action.shard     ] [Balor] [repofiles][4] received shard failed for target shard [[repofiles][4], node[ThrCsp6SQGOvObChZSA9bw], [P], v[26501], s[INITIALIZING], a[id=8swGVD63QT2rc_cO55IYag], unassigned_info[[reason=ALLOCATION_FAILED], at[2018-02-06T02:25:18.490Z], details[failed recovery, failure IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: FileSystemException[/home/cc/seafile/pro-data/search/data/elasticsearch/nodes/0/indices/repofiles/4/translog/translog.ckp -> /home/cc/seafile/pro-data/search/data/elasticsearch/nodes/0/indices/repofiles/4/translog/translog-1820424797732853325.tlog: Aucun espace disponible sur le périphérique]; ]]], indexUUID [khvIF-BERvG4irA6ZEFd3g], message [master {Balor}{ThrCsp6SQGOvObChZSA9bw}{local}{local[1]}{local=true} marked shard as initializing, but shard is marked as failed, resend shard failure]
[2018-02-06 03:25:19,004][WARN ][index.engine             ] [Balor] [repofiles][3] failed engine [delete]
java.io.IOException: Aucun espace disponible sur le périphérique
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:65)
        at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:205)
        at org.elasticsearch.common.io.Channels.writeToChannel(Channels.java:208)
        at org.elasticsearch.index.translog.BufferingTranslogWriter.flush(BufferingTranslogWriter.java:84)
        at org.elasticsearch.index.translog.BufferingTranslogWriter.add(BufferingTranslogWriter.java:66)
        at org.elasticsearch.index.translog.Translog.add(Translog.java:545)
        at org.elasticsearch.index.engine.InternalEngine.innerDelete(InternalEngine.java:613)
        at org.elasticsearch.index.engine.InternalEngine.delete(InternalEngine.java:555)
        at org.elasticsearch.index.shard.TranslogRecoveryPerformer.performRecoveryOperation(TranslogRecoveryPerformer.java:202)
        at org.elasticsearch.index.shard.TranslogRecoveryPerformer.recoveryFromSnapshot(TranslogRecoveryPerformer.java:107)
        at org.elasticsearch.index.shard.IndexShard$1.recoveryFromSnapshot(IndexShard.java:1582)
        at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:235)
        at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:171)[2018-02-06 09:45:06,932][WARN ][cluster.action.shard     ] [Balor] [repofiles][4] received shard failed for target shard [[repofiles][4], node[ThrCsp6SQGOvObChZSA9bw], [P], v[446776], s[INITIALIZING], a[id=GdxtTa6nQ5KZqlXi53Comw], unassigned_info[[reason=ALLOCATION_FAILED], at[2018-02-06T08:45:06.883Z], details[failed recovery, failure IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: FileSystemException[/home/cc/seafile/pro-data/search/data/elasticsearch/nodes/0/indices/repofiles/4/translog/translog.ckp -> /home/cc/seafile/pro-data/search/data/elasticsearch/nodes/0/indices/repofiles/4/translog/translog-8482320269078323280.tlog: **Aucun espace disponible sur le périphérique**]; ]]], indexUUID [khvIF-BERvG4irA6ZEFd3g], message [master {Balor}{ThrCsp6SQGOvObChZSA9bw}{local}{local[1]}{local=true} marked shard as initializing, but shard is marked as failed, resend shard failure]

After index clearing, it seems better…

[2018-02-06 10:32:03,523][INFO ][node                     ] [Balor] stopping ...
[2018-02-06 10:32:04,080][INFO ][node                     ] [Balor] stopped
[2018-02-06 10:32:04,080][INFO ][node                     ] [Balor] closing ...
[2018-02-06 10:32:04,089][INFO ][node                     ] [Balor] closed
[2018-02-06 10:33:58,998][INFO ][node                     ] [Arcade] version[2.4.5], pid[27701], build[c849dd1/2017-04-24T16:18:17Z]
[2018-02-06 10:33:58,999][INFO ][node                     ] [Arcade] initializing ...
[2018-02-06 10:33:59,655][INFO ][plugins                  ] [Arcade] modules [lang-groovy, reindex, lang-expression], plugins [analysis-ik], sites []
[2018-02-06 10:33:59,713][INFO ][env                      ] [Arcade] using [1] data paths, mounts [[/ (rootfs)]], net usable_space [48.7gb], net total_space [55.6gb], spins? [unknown], types [rootfs]
[2018-02-06 10:33:59,713][INFO ][env                      ] [Arcade] heap size [1007.3mb], compressed ordinary object pointers [true]
[2018-02-06 10:34:01,006][INFO ][ik-analyzer              ] try load config from /home/cc/seafile/seafile-pro-server-6.2.8/pro/elasticsearch/config/analysis-ik/IKAnalyzer.cfg.xml
[2018-02-06 10:34:01,346][INFO ][ik-analyzer              ] [Dict Loading] custom/mydict.dic
[2018-02-06 10:34:01,347][INFO ][ik-analyzer              ] [Dict Loading] custom/single_word_low_freq.dic
[2018-02-06 10:34:01,350][INFO ][ik-analyzer              ] [Dict Loading] custom/ext_stopword.dic
[2018-02-06 10:34:01,796][INFO ][node                     ] [Arcade] initialized
[2018-02-06 10:34:01,796][INFO ][node                     ] [Arcade] starting ...
[2018-02-06 10:34:01,799][INFO ][transport                ] [Arcade] publish_address {local[1]}, bound_addresses {local[1]}
[2018-02-06 10:34:01,802][INFO ][discovery                ] [Arcade] elasticsearch/jdlnIV-HT3KkQTZikWtPXg
[2018-02-06 10:34:01,810][INFO ][cluster.service          ] [Arcade] new_master {Arcade}{jdlnIV-HT3KkQTZikWtPXg}{local}{local[1]}{local=true}, reason: local-disco-initial_connect(master)
[2018-02-06 10:34:01,905][INFO ][http                     ] [Arcade] publish_address {127.0.0.1:9200}, bound_addresses {127.0.0.1:9200}
[2018-02-06 10:34:01,906][INFO ][node                     ] [Arcade] started
[2018-02-06 10:34:01,927][INFO ][gateway                  ] [Arcade] recovered [2] indices into cluster_state
[2018-02-06 10:34:02,963][INFO ][cluster.routing.allocation] [Arcade] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[repo_head][2]] ...]).

But when i look into the index process, i found new errors

02/06/2018 10:32:03 [ERROR] seafes:103 thread_task: Error when index repo fd5aea19-4443-4c4d-9261-ea96da1bdfd3
Traceback (most recent call last):
  File "/home/cc/seafile/seafile-pro-server-6.2.8/pro/python/seafes/index_local.py", line 98, in thread_task
    self.fileindexupdater.update_repo(repo_id, commit_id)
  File "/home/cc/seafile/seafile-pro-server-6.2.8/pro/python/seafes/file_index_updater.py", line 78, in update_repo
    self.check_recovery(repo_id)
  File "/home/cc/seafile/seafile-pro-server-6.2.8/pro/python/seafes/file_index_updater.py", line 69, in check_recovery
    status = self.status_index.get_repo_status(repo_id)
  File "/home/cc/seafile/seafile-pro-server-6.2.8/pro/python/seafes/indexes/repo_status.py", line 76, in get_repo_status
    doc = self.es.get(index=self.INDEX_NAME, doc_type=self.MAPPING_TYPE, id=repo_id)
  File "/home/cc/seafile/seafile-pro-server-6.2.8/pro/python/elasticsearch-2.4.1-py2.6.egg/elasticsearch/client/utils.py", line 69, in _wrapped
    return func(*args, params=params, **kwargs)
  File "/home/cc/seafile/seafile-pro-server-6.2.8/pro/python/elasticsearch-2.4.1-py2.6.egg/elasticsearch/client/__init__.py", line 341, in get
    doc_type, id), params=params)
  File "/home/cc/seafile/seafile-pro-server-6.2.8/pro/python/elasticsearch-2.4.1-py2.6.egg/elasticsearch/transport.py", line 327, in perform_request
    status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
  File "/home/cc/seafile/seafile-pro-server-6.2.8/pro/python/elasticsearch-2.4.1-py2.6.egg/elasticsearch/connection/http_urllib3.py", line 106, in perform_request
    raise ConnectionError('N/A', str(e), e)
ConnectionError: ConnectionError(<urllib3.connection.HTTPConnection object at 0x7f7f2ac51a10>: Failed to establish a new connection: [Errno 111] Connection refused) caused by: NewConnectionError(<urllib3.connection.HTTPConnection object at 0x7f7f2ac51a10>: Failed to establish a new connection: [Errno 111] Connection refused)

It seemes that elasticsearch refuses the connection
But the elastic seatrch process seems ok

 ps -ef | grep elasticsearch
cc        1674 14674  0 10:55 pts/0    00:00:00 grep elasticsearch
cc       27701 27694 13 10:33 ?        00:02:55 /usr/bin/java -Xms256m -Xmx1g -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC -Dfile.encoding=UTF-8 -Djna.nosys=true -Des.path.home=/home/cc/seafile/seafile-pro-server-6.2.8/pro/elasticsearch -cp /home/cc/seafile/seafile-pro-server-6.2.8/pro/elasticsearch/lib/elasticsearch-2.4.5.jar:/home/cc/seafile/seafile-pro-server-6.2.8/pro/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch start -Des.path.logs=/home/cc/seafile/logs -Des.path.data=/home/cc/seafile/pro-data/search/data -Des.network.host=127.0.0.1 -Des.insecure.allow.root=true -p /home/cc/seafile/pids/elasticsearch.pid

Elastic serch is running on port 9200

netstat -tulpn | grep 9200
tcp        0      0 127.0.0.1:9200          0.0.0.0:*               LISTEN      27701/java

If i rebuild the index, how can i be sure that it won’t full the disk again ?

On the same way, index normal process is corrupted

Could you have a look, @daniel.pan ?

I’m on pro 6.2.8 version, now.

[UPDATE]

When i launch the index update, i get an error

 ./pro/pro.py search --update

Updating search index, this may take a while...

02/06/2018 14:07:05 [INFO] root:194 main: storage: using filesystem storage backend
02/06/2018 14:07:05 [INFO] root:196 main: index office pdf: True
02/06/2018 14:07:05 [ERROR] seafes:248 check_concurrent_update: another index task is running, quit now

The active process are

ps -ef | grep seafes
cc        5817  5277  0 13:59 ?        00:00:00 /bin/sh -c "/usr/bin/python2.7" "-m" "seafes.index_local" "--logfile" "/home/cc/seafile/logs/index.log" "update"
cc        5818  5817 25 13:59 ?        00:01:54 /usr/bin/python2.7 -m seafes.index_local --logfile /home/cc/seafile/logs/index.log update
cc       15201  5818  0 14:07 ?        00:00:00 timeout 300 java -Dfile.encoding=UTF-8 -jar /home/cc/seafile/seafile-pro-server-6.2.8/pro/python/seafes/poi/ExtractText.jar
cc       15202 15201  0 14:07 ?        00:00:00 java -Dfile.encoding=UTF-8 -jar /home/cc/seafile/seafile-pro-server-6.2.8/pro/python/seafes/poi/ExtractText.jar
cc       15219  3840  0 14:07 pts/0    00:00:00 grep seafes

Regards

To let you know,

I have another server in the same version on which to problem does not occur.

Do you have seafevents running?
Because seafevents background process may run the index program at intervals.
For data accuracy considerations, two index process can not be run at the same time.

i hope you don’t use NFS.

reason:
https://www.elastic.co/guide/en/elasticsearch/guide/2.x/indexing-performance.html#_storage

  1. indices is composed by the file that exists in your server. Deleted files will be automatically deleted in the index.
  2. can’t not flexible limit index scope for now.
  3. @pan has given the answer.

Hi,

Thank you for your answers.
After quite a panic moment, I have rebuilt the index sucessfully

Maybe this error was caused by the fact that i updated the index very shortly after clearing it.

Sorry, but there is a contradiction between the answer of @daniel.pan and the Elastic doc about lofcating the index on a NFS mount

Do not use remote-mounted storage, such as NFS or SMB/CIFS

I’ll follow the doc…

Maybe the better solution is to have a separate Elastic server. If it crashes, it does i alone…

Regards

you can mount new ssd to a directory, then save data to this directory.

1 Like