[BUG] during indexing of pdf in pro version

Hi,

I just installed the pro version of the server (6.2.2) and noticed
that pdf’s are not indexed. I traced the problem back to :
pro/python/seafes/extract.py @ lines 34 and 60. I changed them to :

cmd = [‘timeout’, str(seafes_config.content_extract_time*60), ‘java’, ‘-Dfile.encoding=UTF-8’, ‘-jar’, jarfile]

and

cmd = [‘timeout’, str(seafes_config.content_extract_time*60), ‘pdftotext’, pdf_name, txt_name]

It seems to work now.

Hi,

I got the same issue

On my pro 6.2 server newly upgraded i have

l. 43

cmd = ['timeout', str(seafes_config*60), 'java', '-Dfile.encoding=UTF-8', '-jar', jarfile]

and

l. 59

cmd = ['timeout', str(seafes_config*60), 'pdftotext', pdf_name, txt_name]

If I rebuild the index

I get a lot of errors related to :

[12/19/2017 16:16:28] error when extracting pdf: unsupported operand type(s) for *: 'SeafesConfig' and 'int'

Even for pdf files (that are extracted anyway)

But i do not have any pdf ile indexed !

Does i looks like a bug, @daniel.pan @Jonathan ?

here is the index.log file

Index updated, statistic report:

[12/19/2017 16:06:45] [commit read] 0
[12/19/2017 16:06:45] [dir read]    0
[12/19/2017 16:06:45] [file read]   0
[12/19/2017 16:06:45] [block read]  0
[12/19/2017 16:16:44] storage: using filesystem storage backend
[12/19/2017 16:16:44] index office pdf: True
[12/19/2017 16:16:44] starting worker0 worker threads for indexing
[12/19/2017 16:16:44] starting worker1 worker threads for indexing
[12/19/2017 16:16:45] worker1 worker updated at 2017-12-19 16:16 time
[12/19/2017 16:16:45] worker0 worker updated at 2017-12-19 16:16 time
[12/19/2017 16:16:45] index updated, total time 1.03446102142 seconds
[12/19/2017 16:16:45] start to clear deleted repo
[12/19/2017 16:16:45] deleted repo has been cleared
[12/19/2017 16:16:45]

Index updated, statistic report:

[12/19/2017 16:16:45] [commit read] 0
[12/19/2017 16:16:45] [dir read]    0
[12/19/2017 16:16:45] [file read]   0
[12/19/2017 16:16:45] [block read]  0

Regards,

Gautier

Hi Gautier,

If you edit the files as I wrote in my original post, the bug should be fixed.

Ciao,

Gideon

We have just uploaded version 6.2.3 to fix the problem. Can you upgrade to this version ?

Yes i did

If i clear the index, i get an errror

~/seafile/seafile-pro-server-6.2.3$ ./pro/pro.py search --clear
Delete seafile search index ([y]/n)? y

Delete search index, this may take a while...

/usr/bin/python2.7: No module named seafes.update_repos

it’s the same with

/seafile/seafile-pro-server-6.2.3$ ./pro/pro.py search --update

Updating search index, this may take a while...

/usr/bin/python2.7: No module named seafes.update_repos

I have Python 2.7.9

What happens ?

Gautier

Thanks for reporting the issue. We have uploaded v6.2.4 to fix the problem.

1 Like

Hi,

after update to 6.2.4

./pro/pro.py search --clear
Delete seafile search index ([y]/n)? y

Delete search index, this may take a while...

[12/20/2017 11:52:35] storage: using filesystem storage backend
[12/20/2017 11:52:35] index office pdf: True
[12/20/2017 11:52:35] deleting index repo_head
[12/20/2017 11:52:35] deleting index repofiles

While indexing :

  • many errors from parsing (fonts, etc.) but no significant problem

  • indexiing process is complete

    [12/20/2017 11:58:52] Queue is empty, worker1 worker threads stop
    [12/20/2017 11:58:52] worker1 worker updated at 2017-12-20 11:58 time
    [12/20/2017 11:58:52] index updated, total time 318.883892059 seconds
    [12/20/2017 11:58:52] start to clear deleted repo
    [12/20/2017 11:58:52] deleted repo has been cleared
    [12/20/2017 11:58:52]

    Index updated, statistic report:

    [12/20/2017 11:58:52] [commit read] 26
    [12/20/2017 11:58:52] [dir read] 2026
    [12/20/2017 11:58:52] [file read] 1815
    [12/20/2017 11:58:52] [block read] 2664

  • search into pdf is active

Thank you @gideon @daniel.pan for this quick debug process !