Auto OCR for PDF and other documents feature

This feature was just announced by Dropbox recently, and would me mighty useful for those of us dealing with large volumes of documents.

https://www.google.com/amp/s/blogs.dropbox.com/dropbox/2018/10/search-images-text-ocr/amp/

Any chance we could see something like this in the future??

Or any suggestion on how to accomplish this?

Thanks,

3 Likes

@daniel.pan?

Any ideas?

bump for feature <3

We don’t have a plan to implement such feature yet.

1 Like

Hello,
I think you should try to implement Tesseract OCR in Seafile.

https://www.supinfo.com/articles/single/10070--tesseract-ocr-python

https://lothiancaleysweb.co.uk/how-to-install-fulltextsearch-in-nextcloud-with-elasticsearch-and-tesseract-ocr

https://lothiancaleysweb.co.uk/how-to-install-fulltextsearch-in-nextcloud-with-elasticsearch-and-tesseract-ocr/2

https://lothiancaleysweb.co.uk/how-to-install-fulltextsearch-in-nextcloud-with-elasticsearch-and-tesseract-ocr/3

This seems to be the same game as the implementation of elastic search

Would that accomplish what I posted previously? I.e. the Dropbox auto-ocr implementation?

From what I know about Tesseract my guess is yes.

Following up on this.

As far as I’m aware, this is a tremendously valuable feature of Dropbox Pro; which could make document management within Seafile so much more powerful.

Any plans to implement?

I am not familiar with Dropbox features and I guess few are in this forum. So it would be helpful to have a detailed description of your request. The link you posted above is dead.

What I can tell you is that University of Mainz, Germany has established a process that allows students to scan at the public multi-function devices and save the OCRed files directly in Seafile. The English page is not really insightful - to give it a positive spin - but there is a link to the German page which provides much mor info. If you visit the site with Chrome, just have Chrome do the translation work for you. It should give you an understanding. The article does not mention the OCR part of it, but they did it.

Long story short: It’s totally doable. You don’t have to wait for Seafile to implement it. With a little motivation and ingenuity, your custom process can even account for special circumstances in your organizations.

1 Like

@rdb Thanks for your response.

It’s not a problem when I have documents incoming that are auto OCR by my scanner; but I have a lot of docs that have not been OCR’d as well.

Apologies for the dead link; here’s a TechCrunch article from about a year ago describing same: https://techcrunch.com/2018/10/09/dropbox-finally-adds-automatic-ocr-for-all-your-pdfs-and-photos/

I’ve already setup a scan-to-seafile system by setting up a dedicated library folder for scanning; which is synced with the computer I scan my docs to. As long as the scans are OCR’d its not a problem.

However, there are a lot of docs which are not OCR’d in my collection. My understanding of this dropbox feature is that it will OCR any PDF files or images in your entire collection.

Any ideas on how to accomplish that? I’ve tried a couple of different 3rd party software solutions with little/no success.

Any insight appreciated. Thanks

Sorry, I don’t have any info regarding an extension of the sort you are looking for.
Is this for you personally or for a company? If a company, this could be one of those situations where a custom development is then later made available to the community.

Here’s a more recent blog post by DropBox describing this feature in detail:

Im still looking to implement something along these lines. If anyone has a suggestion that’d be great; if I figure anything out I will also post here.

Thank you,

Here’s what I did:

I started a vm somewhere with seaf-cli syncing all my repositories that could contain pdfs.
Now I am running a cronjob that scans all files for pdfs, checks if there is a text layer and if not calls the ocrmypdf container to process the pdf.

Works like a charm so far.

1 Like