Tesseract ocr linux install

Introduction

Tesseract documentation

Introduction

Tesseract is an open source text recognition (OCR) Engine, available under the Apache 2.0 license. It can be used directly, or (for programmers) using an API to extract printed text from images. It supports a wide variety of languages.

Tesseract doesn’t have a built-in GUI, but there are several available from the 3rdParty page.

Installation

There are two parts to install, the engine itself, and the training data for a language.

Linux

Tesseract is available directly from many Linux distributions. The package is generally called ‘tesseract’ or ‘tesseract-ocr’ — search your distribution’s repositories to find it. Thus you can install Tesseract 4.x and its developer tools on Ubuntu 18.x bionic by simply running:

Note for Ubuntu users: In case apt is unable to find the package try adding universe entry to the sources.list file as shown below.

Packages for over 130 languages and over 35 scripts are also available directly from the Linux distributions. The language packages are called ‘tesseract-ocr-langcode’ and ‘tesseract-ocr-script-scriptcode’, where langcode is three letter language code and scriptcode is four letter script code.

Examples: tesseract-ocr-eng (English), tesseract-ocr-ara (Arabic), tesseract-ocr-chi-sim (Simplified Chinese), tesseract-ocr-script-latn (Latin Script), tesseract-ocr-script-deva (Devanagari script), etc.

For distributions that are supported by snapd you may also run the following command to install the tesseract built binaries(Don’t have snapd installed?):

The traineddata is currently not shipped with the snap package and must be placed manually to

Tesseract Development Version with LSTM engine and related traineddata

5.00 Alpha

Ubuntu PPA

Debian

AppImage

  1. Download AppImage from releases page
  2. Open your terminal application, if not already open
  3. Browse to the location of the AppImage
  4. Make the AppImage executable:
    $ chmod a+x tesseract*.AppImage
  5. Run it:
    ./tesseract*.AppImage -l eng page.tif page.txt
  • Debian: ≥ 10
  • Fedora: ≥ 29
  • Ubuntu: ≥ 18.04
  • CentOS ≥ 8
  • openSUSE Tumbleweed

Included traineddata files

  • deu — German
  • eng — English
  • fin — Finnish
  • fra — French
  • osd — Script and orientation
  • por — Portuguese
  • rus — Russian
  • spa — Spanish

Tesseract 4 packages with LSTM engine and related traineddata

Ubuntu

Ubuntu PPA

Debian

There are also 4.1.x packages for other versions of Debian, check it here https://notesalexp.org/tesseract-ocr/

Raspbian

RHEL/CentOS/Scientific Linux, Fedora, openSUSE packages

For example to install Tesseract with German language traineddata:

Читайте также:  Защитник windows как настроить исключения

For CentOS 8 run the following as root:

For RHEL 7 run the following as root:

For CentOS 7 run the following as root:

For Scientific Linux 7 run the following as root:

For Fedora 32 run the following as root:

For Fedora 31 run the following as root:

For openSUSE Tumbleweed run the following as root:

For openSUSE Leap 15.0 run the following as root:

FOR EXPERTS ONLY.

If you are experimenting with OCR Engine modes, you will need to manually install language training data beyond what is available in your Linux distribution.

Various types of training data can be found on GitHub. Unpack and copy the .traineddata file into a ‘tessdata’ directory. The exact directory will depend both on the type of training data, and your Linux distribution. Possibilities are /usr/share/tesseract-ocr/tessdata or /usr/share/tessdata or /usr/share/tesseract-ocr/4.00/tessdata .

Windows

Installer for Windows for Tesseract 3.05, Tesseract 4 and development version 5.00 Alpha are available from Tesseract at UB Mannheim. These include the training tools. Both 32-bit and 64-bit installers are available.

An installer for the OLD version 3.02 is available for Windows from our download page. This includes the English training data. If you want to use another language, download the appropriate training data, unpack it using 7-zip, and copy the .traineddata file into the ‘tessdata’ directory, probably C:\Program Files\Tesseract-OCR\tessdata .

To access tesseract-OCR from any location you may have to add the directory where the tesseract-OCR binaries are located to the Path variables, probably C:\Program Files\Tesseract-OCR .

Experts can also get binaries build with Visual Studio from the build artifacts of the Appveyor Continuous Integration.

Cygwin

Released version >= 3.02 of tesseract-ocr are part of Cygwin

The latest version available is 4.1.0. Please see announcement.

Источник

Compilation guide for various platforms

Tesseract documentation

Compilation guide for various platforms

Note: This documentation expects you to be familiar with compiling software on your operation system.

Use the same tools for building tesseract as you used for building leptonica.

Table of contents

Linux

To install Tesseract 4.x you can simply run the following command on your Ubuntu 18.xx bionic:

If you wish to install the Developer Tools which can be used for training, run the following command:

The following instructions are for building on Linux, which also can be applied to other UNIX like operating systems.

Dependencies

  • A compiler for C and C++: GCC or Clang
  • GNU Autotools: autoconf, automake, libtool
  • pkg-config
  • Leptonica
  • libpng, libjpeg, libtiff

Ubuntu

If they are not already installed, you need the following libraries (Ubuntu 16.04/14.04):

if you plan to install the training tools, you also need the following libraries:

Leptonica

You also need to install Leptonica. Ensure that the development headers for Leptonica are installed before compiling Tesseract.

Tesseract versions and the minimum version of Leptonica required:

Tesseract Leptonica Ubuntu
4.00 1.74.2 Ubuntu 18.04
3.05 1.74.0 Must build from source
3.04 1.71 Ubuntu 16.04
3.03 1.70 Ubuntu 14.04
3.02 1.69 Ubuntu 12.04
3.01 1.67

One option is to install the distro’s Leptonica package:

but if you are using an oldish version of Linux, the Leptonica version may be too old, so you will need to build from source.

The sources are at https://github.com/DanBloomberg/leptonica . The instructions for building are given in Leptonica README.

Note that if building Leptonica from source, you may need to ensure that /usr/local/lib is in your library path. This is a standard Linux bug, and the information at Stackoverflow is very helpful.

Installing Tesseract from Git

Please follow instructions in Compiling–GitInstallation

Install elsewhere / without root

Tesseract can be configured to install anywhere, which makes it possible to install it without root access.

To install it in $HOME/local:

To install it in $HOME/local using Leptonica libraries also installed in $HOME/local:

In some system, you might also need to specify the path to the pkg-config before running the configure script:

Video representation of the Compiling process for Tesseract 4.0 and Leptonica 1.7.4 on Ubuntu 16.xx

Language Data

  • Download the data file(s) for the language(s) you are interested in.
  • Move it to the tessdata directory (e.g. mv tessdata $TESSDATA\_PREFIX if defined TESSDATA_PREFIX )

You can also use:

to point to your tessdata directory (example: if your tessdata path is ‘/usr/local/share/tessdata’ you have to use ‘export TESSDATA_PREFIX=’/usr/local/share/’).

Windows

master branch, 3.05 and later

Using Tesseract

. IMPORTANT . To use Tesseract in your application (to include tess or to link it into your app) see this very simple example.

Build the latest library (using Software Network client)

  1. Download the latest SW (Software Network https://software-network.org/ ) client from https://software-network.org/client/ .
  2. Run sw setup (may require administrator access)
  3. Run sw build org.sw.demo.google.tesseract.tesseract-master .

For visual studio project using tesseract

  1. Setup Vcpkg the Visual C++ Package Manager.
  2. Run vcpkg install tesseract:x64-windows for 64-bit. Use –head for the master branch.

Static linking

To build a self-contained tesseract.exe executable (without any DLLs or runtime dependencies), use Vcpkg as above with the following command:

  • vcpkg install tesseract:x64-windows-static for 64-bit
  • vcpkg install tesseract:x86-windows-static for 32-bit

Use –head for the master branch. It may still require one DLL for the OpenMP runtime, vcomp140.dll (which you can find in the Visual C++ Redistributable 2015).

Build training tools

Today it is possible to build a full set of tess training tools on Windows with Visual Studio. The latest versions (Win10, 2019) are preferable.

  1. Download the latest SW (Software Network https://software-network.org/client/ ) client from https://software-network.org/client/ .
  2. Checkout tesseract sources git clone https://github.com/tesseract-ocr/tesseract tesseract && cd tesseract .
  3. Run sw build .
  4. Binaries will be available under .sw\out\some hash dir.

Develop Tesseract

For development purposes of Tesseract itself do the next steps:

  1. Download and install Git, CMake and put them in PATH.
  2. Download the latest SW (Software Network https://software-network.org/ ) client from https://software-network.org/client/ . SW is a source package distribution system.
  3. Add SW client to PATH.
  4. Run sw setup (may require administrator access)
  5. If you have a release archive, unpack it to tesseract dir.

If you’re using master branch run

Build a solution ( tesseract.sln ) in your Visual Studio version. If you want to build and install from command line (e.g. Release build) you can use this command:

If you want to install to other directory that C:\Program Files (you will need admin right for this), you need to specify install path during configuration:

For development purposes of training tools after cloning a repo from previous paragraph, run

You’ll see a solution link appeared in the root directory of Tesseract.

Building for x64 platform

If you’re building with sw+cmake, run cmake as follows:

If you’re building with sw run sw generate , it will create a solution link for you (not yet implemented!).

If you have Visual Studio 2015, checkout the https://github.com/peirick/VS2015_Tesseract repository for Visual Studio 2015 Projects for Tessearct and dependencies. and click on build_tesseract.bat. After that you still need to download the language packs.

3.03rc-1

Download these packages from the Downloads Archive on SourceForge page:

  • tesseract-3.01.tar.gz — Tesseract source
  • tesseract-3.01-win_vs.zip — Visual studio (2008 & 2010) solution with necessary libraries
  • tesseract-ocr-3.01.eng.tar.gz — English language file for Tesseract (or download other language training file)

Unpack them to one directory (e.g. tesseract-3.01 ). Note that tesseract-ocr-3.01.eng.tar.gz names the root directory ‘tesseract-ocr’ instead of ‘tesseract-3.01’ .

Windows relevant files are located in vs2008 directory (e.g. ‘tesseract-3.01\vs2008’). The same build process as usual applies: Open tesseract.sln with VC++Express 2008 and build all (or just Tesseract.) It should compile (in at least release mode) without having to install anything further. The dll dependencies and Leptonica are included. Output will be in tesseract-3.01\vs2008\bin (or tesseract-3.01\vs2008\bin.rd or tesseract-3.01\vs2008\bin.dbg based on configuration build).

Mingw+Msys

Msys2

Download and install MSYS2 Installer from https://msys2.github.io/

The core packages groups you need to install if you wish to build from PKGBUILDs are:

  • base-devel for any building
  • msys2-devel for building msys2 packages
  • mingw-w64-i686-toolchain for building mingw32 packages
  • mingw-w64-x86_64-toolchain for building mingw64 packages

To build the tesseract-ocr release package, use PKGBUILD from https://github.com/Alexpux/MINGW-packages/tree/master/mingw-w64-tesseract-ocr

Cygwin

To build on Cygwin have a look at blog How to build Tesseract on Cygwin.

Tesseract as well as the training utilities for 3.04.00 onwards are available as Cygwin packages.

Mingw-w64

Mingw-w64 allows building 32- or 64-bit executables for Windows. It can be used for native compilations on Windows, but also for cross compilations on Linux (which are easier and faster than native compilations). Most large Linux distributions already contain packages with the tools need for a cross build. Before building Tesseract, it is necessary to build some prerequisites.

For Debian and similar distributions (e. g. Ubuntu), the cross tools can be installed like that:

These prerequisites will be needed:

macOS

Typically a package manager like Fink, Homebrew or MacPorts is needed in addition to Apple’s Xcode. Xcode and the related command line tools provides the compiler ( llvm-gcc ) and linker, but also libraries like zlib . The package manager provides free software packages which are not part of Xcode.

The Xcode Command Line Tools can be installed by running xcode-select —install .

Note that Tesseract 4 can be built with OpenMP support, but that requires additional installations.

macOS with Fink

Fink (as of 2017-04) neither provides Leptonica nor the packages needed for the Tesseract training tools, so it cannot be recommended for building Tesseract.

macOS with MacPorts

Prepare support for OpenMP (optional)

The following method which gets, compiles and installs OpenMP manually should no longer be needed:

Источник

Читайте также:  Плагин esep mac os
Оцените статью