Tesseract ocr mac os

Introduction

Tesseract documentation

Introduction

Tesseract is an open source text recognition (OCR) Engine, available under the Apache 2.0 license. It can be used directly, or (for programmers) using an API to extract printed text from images. It supports a wide variety of languages.

Tesseract doesn’t have a built-in GUI, but there are several available from the 3rdParty page.

Installation

There are two parts to install, the engine itself, and the training data for a language.

Linux

Tesseract is available directly from many Linux distributions. The package is generally called ‘tesseract’ or ‘tesseract-ocr’ — search your distribution’s repositories to find it. Thus you can install Tesseract 4.x and its developer tools on Ubuntu 18.x bionic by simply running:

Note for Ubuntu users: In case apt is unable to find the package try adding universe entry to the sources.list file as shown below.

Packages for over 130 languages and over 35 scripts are also available directly from the Linux distributions. The language packages are called ‘tesseract-ocr-langcode’ and ‘tesseract-ocr-script-scriptcode’, where langcode is three letter language code and scriptcode is four letter script code.

Examples: tesseract-ocr-eng (English), tesseract-ocr-ara (Arabic), tesseract-ocr-chi-sim (Simplified Chinese), tesseract-ocr-script-latn (Latin Script), tesseract-ocr-script-deva (Devanagari script), etc.

For distributions that are supported by snapd you may also run the following command to install the tesseract built binaries(Don’t have snapd installed?):

The traineddata is currently not shipped with the snap package and must be placed manually to

Tesseract Development Version with LSTM engine and related traineddata

5.00 Alpha

Ubuntu PPA

Debian

AppImage

  1. Download AppImage from releases page
  2. Open your terminal application, if not already open
  3. Browse to the location of the AppImage
  4. Make the AppImage executable:
    $ chmod a+x tesseract*.AppImage
  5. Run it:
    ./tesseract*.AppImage -l eng page.tif page.txt
  • Debian: ≥ 10
  • Fedora: ≥ 29
  • Ubuntu: ≥ 18.04
  • CentOS ≥ 8
  • openSUSE Tumbleweed

Included traineddata files

  • deu — German
  • eng — English
  • fin — Finnish
  • fra — French
  • osd — Script and orientation
  • por — Portuguese
  • rus — Russian
  • spa — Spanish

Tesseract 4 packages with LSTM engine and related traineddata

Ubuntu

Ubuntu PPA

Debian

There are also 4.1.x packages for other versions of Debian, check it here https://notesalexp.org/tesseract-ocr/

Raspbian

RHEL/CentOS/Scientific Linux, Fedora, openSUSE packages

For example to install Tesseract with German language traineddata:

For CentOS 8 run the following as root:

For RHEL 7 run the following as root:

For CentOS 7 run the following as root:

For Scientific Linux 7 run the following as root:

For Fedora 32 run the following as root:

For Fedora 31 run the following as root:

For openSUSE Tumbleweed run the following as root:

For openSUSE Leap 15.0 run the following as root:

FOR EXPERTS ONLY.

If you are experimenting with OCR Engine modes, you will need to manually install language training data beyond what is available in your Linux distribution.

Various types of training data can be found on GitHub. Unpack and copy the .traineddata file into a ‘tessdata’ directory. The exact directory will depend both on the type of training data, and your Linux distribution. Possibilities are /usr/share/tesseract-ocr/tessdata or /usr/share/tessdata or /usr/share/tesseract-ocr/4.00/tessdata .

Windows

Installer for Windows for Tesseract 3.05, Tesseract 4 and development version 5.00 Alpha are available from Tesseract at UB Mannheim. These include the training tools. Both 32-bit and 64-bit installers are available.

An installer for the OLD version 3.02 is available for Windows from our download page. This includes the English training data. If you want to use another language, download the appropriate training data, unpack it using 7-zip, and copy the .traineddata file into the ‘tessdata’ directory, probably C:\Program Files\Tesseract-OCR\tessdata .

Читайте также:  Альтернатива adobe illustrator для linux

To access tesseract-OCR from any location you may have to add the directory where the tesseract-OCR binaries are located to the Path variables, probably C:\Program Files\Tesseract-OCR .

Experts can also get binaries build with Visual Studio from the build artifacts of the Appveyor Continuous Integration.

Cygwin

Released version >= 3.02 of tesseract-ocr are part of Cygwin

The latest version available is 4.1.0. Please see announcement.

Источник

Tesseract ocr mac os

This is an open-source macOS-based Objective-C wrapper for the OCR library Tesseract.

You can also use this in Swift, instructions below.

Fork this repo if you want to experiment with it.

The wrapper consists of just the following files

  • SLTesseract.h (Header file)
  • SLTesseract.mm (Implementation file)
  • tessdata/ (Language files for Tesseract)
  • lib/ (Compiled dependencies)
  • include/ (Headers for the dependencies)

For those of you who wish to first test out the OCR capabilities, the included Screenshot-OCR is a demo application to showcase this.

First build the Xcode project included in this repository. This will generate an application through wish you can take a screenshot, as shown in the following gif.

In the Xcode log you will find the corresponding text Tesseract detected for this screenshot.

Getting this to work in your own project

Clone this project

Copy over the include , lib , and tessdata folders to your project.

Add these folders to your project in Xcode. Make sure include and lib are added as groups and tessdata is added as a folder reference.

The location of this setting is shown in the following image:

Copy over the files SLTesseract.mm and SLTesseract.h to your code directory.

Verify that the file SLTesseract.mm is added to Targets > Build Phases > Compile Sources . Additionally, verify that all the static libraries are also added to Targets > Build Phases > Link Binary With Libraries . (This process should be done automatically)

You are now ready to use Tesseract in your macOS project. (See Example Usage for code syntax)

At the top of the file include the header file

will initiallize the class SLTesseract.

(optional) ocr.language = @»eng»;

(optional) ocr.charWhitelist = @»abcdefghijklmnopqrstuvwxyz»

(optional) ocr.charBlacklist = @»1234567890″

Finally, assuming you already have the image that you wish to perform OCR on in NSImage form, you can call

to recognize the image in question and get the corresponding text.

This library can be easily imported in a Swift project.

Just replicate all the steps above.

When adding .h and .mm files you will be prompted by Xcode to add a Bridging Header (if don’t have one already).

Xcode will generate a file named yourProject-Bridging-Header.h

Add this line to the Bridging Header:

Initialize like this:

(optional) ocr.language = «eng»

(optional) ocr.charWhitelist = «abcdefghijklmnopqrstuvwxyz»

(optional) ocr.charBlacklist = «1234567890»

Finally perform OCR by doing this:

The libraries below are all included in the lib/ directory.

Additionally libcurl is required. To add libcurl , select your target in Xcode, select Build Phases tab and under Link Binary With Libraries phase click on the + button and type libcurl . Select libcurl.tbd .

My project Tesseract macOS itself is distributed under the MIT license (see LICENSE);

Keep in mind that the main dependency Tesseract is distributed under the Apache 2.0 license.

Open an issue if you want something fixed.

You may reach me at Tesseract-macOS@scott-liu.com to inquire about this project.

About

Objective C wrapper for the open source OCR Engine Tesseract (macOS)

Источник

tesseract install mac os

I am trying to install tesseract on my mac using homeBrew. When I try installing, everything seems to be good but I get the following error/message:

When I try running a tesseract function, I get the following error:

I have image magick installed and the command I used to install tesseract was:

Can anyone please tell me what I can do to get tesseract working? Thank you!

EDIT When I run brew link leptonica, I get the following error:

4 Answers 4

Now, as of September 2019, there are no longer any optional install flags available

yield nothing. But,

yields the following key info:

Therefore, to get all of the languages installed, you need to now install a separate library called tesseract-lang .

Читайте также:  Посекторное копирование жесткого диска linux

Hope this helps.

old in case this is useful:

Now, as of January 2019, Tesseract installs fine via homebrew, as long as you have xquartz installed first, brew cask install xquartz . Then you can do the following: brew install tesseract —with-all-languages —with-serial-num-pack —with-training-tools

After installing / removing and re-installing tesseract i found the solution for the same problem you have. On your terminal logs, while installing tesseract, you will see:

Error: The brew link step did not complete successfully

The formula built, but is not symlinked into /usr/local Could not symlink bin/convertfilestopdf Target /usr/local/bin/convertfilestopdf already exists.
You may want to remove it: rm ‘/usr/local/bin/convertfilestopdf’
To force the link and overwrite all conflicting files: brew link —overwrite leptonica

To list all files that would be deleted: brew link —overwrite —dry-run leptonica

What i did was running: brew link —overwrite leptonica

«Linking /usr/local/Cellar/leptonica/1.71_1. 45 symlinks created»

Источник

Installing Tesseract on a Mac (OSX 10.8)

Despite finding several pages with instructions on how to install Tesseract, I found that I had to cobble together my own set of instructions using bits and pieces of information I gathered from all of them.
UPDATED — May, 2015: With the assistance of many fantastic participants in various OCR workshops we’ve held over the last year, these instructions have being updated. The following is what has worked best and most consistently for most people.

Please reference our handy UNIX command cheat sheet for some extra help with the Terminal commands.

Tesseract Setup:

MacPorts:

MacPorts is an open-source software package management tool that makes it relatively easy for Mac users to compile, install and upgrade open-source software and their dependencies. It’s a great first step in installing Tesseract on a Mac.

  1. It will be helpful during this install process to be able to see your hidden files (those files and folders that start with a «.», and which normally aren’t displayed in the Finder or Terminal.
    1. Open a Terminal window
    2. Enter: defaults write com.apple.finder AppleShowAllFiles YES
    3. Close and reopen any Finder or Terminal windows.

    Install XCode from the App store, or from the Mac Developer website if you need an older version.

    Xcode is a Mac Developer application. The version in the App Store (6.3.1) is only for Mac OSX Yosemite 10.10, or later. If you have an older version of the Mac OS then you’ll need to create a Mac Developer ID at the link above and then find the appropriate version of Xcode for your OS:

    • OSX Mavericks 10.9: Xcode 6.2
    • OSX Mountain Lion 10.8: Xcode 5.1.1
    • Earlier versions are also available.

    Be sure to install the full Xcode package («Xcode 6.2») rather than any of the smaller components like command line tools, etc.

    You’ll need to accept the Xcode license agreement before you can use it or do some of the following steps:

    1. Open your Applications folder and find the new Xcode app
    2. Open Xcode.
    3. Accept the license agreement.
    4. Close Xcode.
  2. Install code and dependancies for Tesseract:
    1. sudo port install autoconf
    2. sudo port install automake
    3. sudo port install libtool
    4. sudo port install jpeg tiff libpng
    5. sudo port install leptonica
  3. Finally, make sure everything is up to date and properly installed: sudo port selfupdate

Installing Tesseract:

There are a couple of options here at this point. Using MacPorts is the easiest and fastest way to install Tesseract. This will install the latest «released» version of Tesseract, which is version 3.02.02. That version works fine, but does not include code which writes the confidence levels of each word (x_wconf) to the hOCR output files. The x_wconf values are necessary for eMOP post-processing algorithms to work. If you want to use eMOP’s hOCR Denoising and or eMOP’s Page Corrector, then you will need to install Tesseract version 3.03. To do that, you will need to install Tesseract from source using SVN.

with MacPorts: [3.02.02]

  1. sudo port install tesseract
  2. You can also install Tesseract’s default english language training set (or any other language training set already available here) by doing sudo port install tesseract-eng
Читайте также:  Windows cmd clear console

from Source (SVN): [3.03]

These instructions will install Tesseract in a folder called tesseract-ocr/ in your home folder (/Users/[your-username]/ or «

» or «$HOME» for short).

svn checkout http://tesseract-ocr.googlecode.com/svn/trunk/ tesseract-ocr
This will download the Tesseract files into a folder called tesseract-ocr in your home directory

Warning: If the autogen shell script fails due to aclocal you can fix it by adding to your $PATH system variable.
PATH=$PATH\:

/devtools/autotools-bin/bin/; export $PATH

If configure is successful you will see something like:

Configuration is done.
You can now build and install tesseract by running:
.

If not, then you can scroll up to see where your failure is occurring.

Warning: If configure fails because it can’t find leptonica, then you can create a symlink that will tell the system where leptonica has been installed.
ln -s /opt/local/include /usr/local/include

sudo make install

Test to see if Tesseract installed properly by typing tesseract .

Warning: If the command can not be found, then you need to move the tesseract executable into a folder that’s part of the PATH system variable.
copy ./api/tesseract and ./api/.libs to /opt/local/bin/

NOTE: If you read the Tesseract install instructions or paid close attention to the messages displayed with the above steps you will have seen mention of making install-langs. I have not been able to get the «make install-langs» command to work for quite some time. But it’s not really something to be concerned about. All that command does is download and install language (i.e. typeface with language-specific dictionary) training from the Google website and install it in the tessdata/ folder in tesseract-ocr/. We can do the same thing by hand by downloading any language training from various websites (Google Code or eMOP Github for example) and putting it in the tessdata/ folder as needed.

Check your permissions
Some users may need to change the permissions of the downloaded .traineddata files in the tessdata/ folder in order to use them.

/tesseract-ocr/tessdata

  • ls -l to see the permission for all files in your folder.
  • if your .traineddata file has something like -rw-r—— to the left of it, then
  • sudo chmod 777 *.traineddata will give every user and every app permissions to do anything with all the .traineddata files in the folder. That will fix any permissions problems you might have.
  • TESSDATA_PREFIX

    Finally, you have to set the $TESSDATA_PREFIX system variable so that the Tesseract command knows where to find the tessdata/ folder that contains the files it needs to run on the language training you create. Any Tesseract training that you create or download will include a .traineddata file which must be present in the tessdata/ folder, and the parent folder of tessdata/ must be identified by the $TESSDATA_PREFIX system variable.

      To see the value of the $TESSDATA_PREFIX in your current Terminal session:
      echo $TESSDATA_PREFIX
      It should be blank at this point.

    To set the value of the $TESSDATA_PREFIX in your current Terminal session:
    export TESSDATA_PREFIX=»/Users/[your-username]/tesseract-ocr» , or
    export TESSDATA_PREFIX=»$HOME/tesseract-ocr»

    NOTE: DO NOT use the ‘

    ‘ character as a shortcut to your home directory in the TESSDATA_PREFIX. It just doesn’t work. Use the whole filepath.

  • Setting the TESSDATA_PREFIX with the export command will only set the system variable for this session of your terminal. To make this a permanent assignment that will be applied every time you open a new terminal window, you can add the above export command to the .profile file in your home directory.
    1. Open your Finder, and go to your home directory (/Users/[your-username]/
    2. Find the .profile file (which will be visible, but gray if you did step #1 above), and double-click.
    3. It should open in your default text editor. If not, then select a text editor to open the file with.
    4. Add the above export command to the end of the file and Save.
    5. Open another Terminal window and enter echo $TESSDATA_PREFIX . You should see the correct file path now.
  • Источник

    Оцените статью