Tesseract Open Source OCR Engine (main repository)

Tesseract OCR

Build Status Build status Build status
Coverity Scan Build Status Code Quality: Cpp Total Alerts OSS-Fuzz
GitHub license Downloads

About

This package contains an OCR engine - libtesseract and a command line program - tesseract. Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused on line recognition, but also still supports the legacy Tesseract OCR engine of Tesseract 3 which works by recognizing character patterns. Compatibility with Tesseract 3 is enabled by using the Legacy OCR Engine mode (--oem 0). It also needs traineddata files which support the legacy engine, for example those from the tessdata repository.

The lead developer is Ray Smith. The maintainer is Zdenko Podobny. For a list of contributors see AUTHORS and GitHub's log of contributors.

Tesseract has unicode (UTF-8) support, and can recognize more than 100 languages "out of the box".

Tesseract supports various output formats: plain text, hOCR (HTML), PDF, invisible-text-only PDF, TSV. The master branch also has experimental support for ALTO (XML) output.

You should note that in many cases, in order to get better OCR results, you'll need to improve the quality of the image you are giving Tesseract.

This project does not include a GUI application. If you need one, please see the 3rdParty documentation.

Tesseract can be trained to recognize other languages. See Tesseract Training for more information.

Brief history

Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google.

The latest (LSTM based) stable version is 4.1.1, released on December 26, 2019. Latest source code is available from master branch on GitHub. Open issues can be found in issue tracker, and planning documentation.

The latest 3.0x version is 3.05.02, released on June 19, 2018. Latest source code for 3.05 is available from 3.05 branch on GitHub. There is no development for this version, but it can be used for special cases (e.g. see Regression of features from 3.0x).

See Release Notes and Change Log for more details of the releases.

Installing Tesseract

You can either Install Tesseract via pre-built binary package or build it from source.

Supported Compilers are:

  • GCC 4.8 and above
  • Clang 3.4 and above
  • MSVC 2015, 2017, 2019

Other compilers might work, but are not officially supported.

Running Tesseract

Basic command line usage:

tesseract imagename outputbase [-l lang] [--oem ocrenginemode] [--psm pagesegmode] [configfiles...]

For more information about the various command line options use tesseract --help or man tesseract.

Examples can be found in the documentation.

For developers

Developers can use libtesseract C or C++ API to build their own application. If you need bindings to libtesseract for other programming languages, please see the wrapper section in the AddOns documentation.

Documentation of Tesseract generated from source code by doxygen can be found on tesseract-ocr.github.io.

Support

Before you submit an issue, please review the guidelines for this repository.

For support, first read the documentation, particularly the FAQ to see if your problem is addressed there. If not, search the Tesseract user forum, the Tesseract developer forum and past issues, and if you still can't find what you need, ask for support in the mailing-lists.

Mailing-lists:

Please report an issue only for a bug, not for asking questions.

License

The code in this repository is licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

NOTE: This software depends on other packages that may be licensed under different open source licenses.

Tesseract uses Leptonica library which essentially uses a BSD 2-clause license.

Dependencies

Tesseract uses Leptonica library for opening input images (e.g. not documents like pdf). It is suggested to use leptonica with built-in support for zlib, png and tiff (for multipage tiff).

Latest Version of README

For the latest online version of the README.md see:

https://github.com/tesseract-ocr/tesseract/blob/master/README.md

Comments
  • RFC: Tesseract 4.0.0 – open tasks

    RFC: Tesseract 4.0.0 – open tasks

    I'd like to collect open tasks which should be addressed before tagging the official release 4.0.0.

    These tasks are on my own list and to be discussed whether we consider them important for the new release or not:

    • Remove deprecated code. This does not include OpenCL or the old Tesseract engine.
    • Add --version parameter for all command line commands.
    • Enhance --list-langs to show additional information for scripts and languages like legacy / LSTM, version. This will make the command slower, because each file must be opened and parsed.
    • Add option to optionally select implementation for dot product (CPU, SSE, AVX, ...).
    • Relative includes for traineddata: tessedit_load_sublangs should search for the sublangs relative to the parent, not starting in tessdata dir.
    • Maybe more fixes for compiler warnings and issues reported by Coverity Scan.
    • (list still incomplete)
  • Build Tesseract from source with Visual Studio

    Build Tesseract from source with Visual Studio


    Environment

    • Tesseract Version: 5.0.0 alfa
    • Commit Number: a1a177f
    • Platform:Windows 10 64 bit

    Current Behavior:

    I can not build from source i had download SW client and save it at "D:\Essam\Software\SW" the add to Path and i can run SW in command line and see WS information as follow D:\Tutorial\Git\tesseract\build>sw --version sw.client.sw version 1.0.0 git revision 083bb99144549c1f361298e8284daa6b54422965 assembled on 30.01.2020 18:36:29 Egypt Standard Time

    then i run the following commands to compile from source as describe in the following link https://github.com/tesseract-ocr/tesseract/wiki/Compiling the command are

    git clone https://github.com/tesseract-ocr/tesseract tesseract cd tesseract mkdir build && cd build cmake .. -G "Visual Studio 15 2017 Win64" -DCMAKE_INSTALL_PREFIX=inst

    i receive the following error

    "-- Selecting Windows SDK version 10.0.17763.0 to target Windows 10.0.18363. Configuring tesseract version 5.0.0-alpha-621-ga1a17... -- target changed from "auto" to "kaby-lake" CMake Error at CMakeLists.txt:197 (find_package): By not providing "FindSW.cmake" in CMAKE_MODULE_PATH this project has asked CMake to find a package configuration file provided by "SW", but CMake did not find one.

    Could not find a package configuration file provided by "SW" with any of the following names:

    SWConfig.cmake
    sw-config.cmake
    

    Add the installation prefix of "SW" to CMAKE_PREFIX_PATH or set "SW_DIR" to a directory containing one of the above files. If "SW" provides a separate development package or SDK, be sure it has been installed.

    -- Configuring incomplete, errors occurred! See also "D:/Tutorial/Git/tesseract/build/CMakeFiles/CMakeOutput.log"."

    the log file attached

    CMakeOutput.log

    Expected Behavior:

    build tesseract solution

    Suggested Fix:

  • Tag a new version for LSTM  4.0

    Tag a new version for LSTM 4.0

    Many fixes have been made to master branch for 4.0 since the 4.00.00alpha release in November 2016. A number of assertions have been fixed.

    @zdenop Please add a new tag eg. 4.0.0alpha-1 / 2 (numbering as you consider appropriate). Thanks!

  • RFC: Remove the legacy OCR Engine

    RFC: Remove the legacy OCR Engine

    Ray wants to get rid of the legacy OCR engine, so that the final 4.00 version will only have one OCR engine based on LSTM.

    From #518:

    @stweil commented:

    I strongly vote against removing non-LSTM as we currently still get better results with it in some cases.

    @theraysmith commented:

    Please provide examples of where you get better results with the old engine. Right now I'm trying to work on getting rid of redundant code, rather than spending time fighting needless changes that generate a lot of work. I have recently tested an LSTM-based OSD, and it works a lot better than the old, so that is one more use of the old classifier that can go. AFAICT, apart from the equation detector, the old classifier is now redundant.

  • good accuracy but too slow, how to improve Tesseract speed

    good accuracy but too slow, how to improve Tesseract speed

    I integrated Tesseract C/C++, version 3.x, to read English OCR on images.

    It’s working pretty good, but very slow. It takes close to 1000ms (1 second) to read the attached image (00060.jpg) on my quad-core laptop.

    I’m not using the Cube engine, and I’m feeding only binary images to the OCR reader.

    Any way to make it faster. Any ideas on how to make Tesseract read faster? thanks 00060

  • Tesseract 4.0.0 crashed on Intel I5-8400 CPU with Debian 9.6.0 amd64 (SSE/AVX/AVX2)

    Tesseract 4.0.0 crashed on Intel I5-8400 CPU with Debian 9.6.0 amd64 (SSE/AVX/AVX2)

    Environment

    • Tesseract Version: 4.0.0 Release
    • Commit Number: 51316994ccae0b48692d547030f26c0969308214
    • Platform: Debian 9.6.0 amd64

    Current Behavior: Tesseract 4.0.0 crashed on Itel I5-8400 CPU with Debian 9.6.0 amd64 (SSE/AVX/AVX2).

    I compiled the tesseract 4.0 on Itel I5-8400 CPU with Debian 9.6.0 amd64. tesseract --version output this: tesseract 4.0.0 leptonica-1.74.2 libjpeg 6b (libjpeg-turbo 1.5.1) : libpng 1.6.28 : libtiff 4.0.8 : zlib 1.2.8 Found AVX2 Found AVX Found SSE

    When I call tesseract several times, crash happens and PC is reboot.

    I have a Intel G4650 CPU and this CPU not suport AVX2 / AVX and everything works fine! Never crash happens! How to make tesseract work fine on Intel I5-8400 with AVX/AVX2/SSE.

    Expected Behavior:

    Suggested Fix:

  • RFC: Add initial support for traineddata files in compressed archive formats (don't merge)

    RFC: Add initial support for traineddata files in compressed archive formats (don't merge)

    This requires libminizip-dev, so expect failures from CI.

    Up to now, little endian tesseract works with the new zip format.

    More work is needed for training tools and big endian support and also to maintain compatibility with the current proprietary format.

    Signed-off-by: Stefan Weil [email protected]

  • trying to add tessedit_char_whitelist etc. again:

    trying to add tessedit_char_whitelist etc. again:

    • ignore matrix outputs in ComputeTopN if they belong to a disabled unichar_id
    • pass UNICHARSET refs to check that
    • in SetBlackAndWhitelist, also update the unicharset of the lstm_recognizer_ instance, if any
  • RFC: Reorganize source tree

    RFC: Reorganize source tree

    I'd like to propose changes to tesseract source tree structure. Today the common way is to have src folder with all program stuff and include folder with public headers. Now we have a lot of dirs in the root - that's very annoying. On the first stage I propose:

    1. move all sources into src
    2. move training tools from training to tools/training

    Later we can try to move public headers to include directory.

    The new look will be like: pic

    If there are no objections, I'll commit changes.

  • 4.0 bugs on MAC OS X and a step by step for reference

    4.0 bugs on MAC OS X and a step by step for reference

    This is step by step that I used to install tesseract 4.0 on my MAC OS X and the fixes/workaround I needed to do so I could make it work. I'm sharing this "guide" with the intention of helping other people who may have the same problems I had.

    Special thanks for Shree that helped me at the google groups

    Project and more details: https://github.com/tesseract-ocr/tesseract

    where to get help?

    google group: https://groups.google.com/forum/#!forum/tesseract-ocr git: https://github.com/tesseract-ocr/tesseract/issues

    Platform: MAC OS X 10.13.3 Tesseract: 4.0.0-beta.1-69-g10f4 leptonica-1.75.3 libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11

    Found AVX2 Found AVX Found SSE

    Compiling Tesseract - tesseract 4.0

    Reference: https://github.com/tesseract-ocr/tesseract/wiki/Compiling#macos

    Warning: Don't install tesseract using brew, since you can't generate the ScrollView.jar from it! (At least I wasn't able to generate it)

    Steps

    1 - Install these libs

    brew install automake autoconf autoconf-archive libtool
    brew install pkgconfig
    brew install icu4c
    brew install leptonica
    brew install gcc
    

    2 - Run the code

    ln -hfs /usr/local/Cellar/icu4c/60.2 /usr/local/opt/icu4c
    

    Obs.: text2image is set to use icu4c/60.2 but the actual version is icu4c/61.1

    3 - Clone tesseract repo

    git clone https://github.com/tesseract-ocr/tesseract/
    

    4 - Enter in the folder

    cd tesseract
    

    5 - Run the script

    ./autogen.sh
    

    6 - Run the code, and copy the CPPFLAGS and LDFLAGS

    brew info icu4c
    

    7 - Update the CPPFLAGS and LDFLAGS and execute the code

    ./configure \
      CPPFLAGS=-I/usr/local/opt/icu4c/include \
      LDFLAGS=-L/usr/local/opt/icu4c/lib
    

    8 - Run the code

    make -j
    

    9 - Run the code

    sudo make install
    

    10 - Run the code

    sudo update_dyld_shared_cache
    

    Obs.: this is the sudo ldconfig version for MAC OS X

    11 - Run the code

    make training
    

    Creating ScrollView.jar - tesseract 4.0

    Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#lstmtraining-command-line https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging

    Important: Use the JDK 8 to build, or else it is going to return an error

    Steps

    1 - Download the files piccolo2d-core-3.0.jar and piccolo2d-extras-3.0.jar

    http://search.maven.org/remotecontent?filepath=org/piccolo2d/piccolo2d-core/3.0/piccolo2d-core-3.0.jar http://search.maven.org/remotecontent?filepath=org/piccolo2d/piccolo2d-extras/3.0/piccolo2d-extras-3.0.jar

    2 - Move the files piccolo2d-core-3.0.jar and piccolo2d-extras-3.0.jar to tesseract/java

    3 - Enter the tesseract/java folder

    cd java
    

    4 - Set the var SCROLLVIEW_PATH to your tesseract/java folder and run the code

    SCROLLVIEW_PATH=~/projects/tesseract/java make ScrollView.jar
    

    Training Font - tesseract 4.0

    Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#user-content-using-tesstrain

    Steps

    1 - Clone the langdata dir from git

    git clone https://github.com/tesseract-ocr/langdata
    

    2 - Enter the tesseract folder

    cd ..
    

    3 - Execute this code and select one font from the list (I recommend "Verdana")

    text2image --list_available_fonts --fonts_dir=/Library/Fonts
    

    Font dir for MAC can be : ~/Library/Fonts /Library/Fonts/ /Network/Library/Fonts/ /System/Library/Fonts/ /System Folder/Fonts/

    More details here: https://support.apple.com/en-us/HT201722

    4 - replace the line 195 at file tesseract/training/tesstrain_utils.sh from

    - export FONT_CONFIG_CACHE=$(mktemp -d --tmpdir font_tmp.XXXXXXXXXX)
    + export FONT_CONFIG_CACHE=$(mktemp -d -t font_tmp.XXXXXXXXXX)
    

    Obs.: this is a fix for the error:

    mktemp: illegal option -- -
    usage: mktemp [-d] [-q] [-t prefix] [-u] template ...
           mktemp [-d] [-q] [-u] -t prefix
    /Users/username/projects/tesseract/training/tesstrain_utils.sh: line 197: /sample_text.txt: Permission denied
    

    5 - Clone the tessdata repo from git (i recommend the "tessdata_best" since it is the more precise, "tessdata_fast" is just more fast)

    git clone https://github.com/tesseract-ocr/tessdata_best
    

    or

    git clone https://github.com/tesseract-ocr/tessdata_fast
    

    6 - Copy the tessdata_best/eng.traineddata (for english training) from the tessdata you just cloned and past at tesseract/tessdata/

    7 - Create the training data

    PANGOCAIRO_BACKEND=fc \
    ~/projects/tesseract/training/tesstrain.sh \
      --fonts_dir /Library/Fonts \
      --lang eng \
      --linedata_only \
      --noextract_font_properties \
      --exposures "0"    \
      --langdata_dir ~/projects/langdata \
      --tessdata_dir ~/projects/tesseract/tessdata \
      --fontlist "Verdana" \
      --output_dir ~/tesstutorial/engtrain
    

    Add the prefix PANGOCAIRO_BACKEND=fc if using MAC OSX

    8 - Create other training data using other font to compare

    PANGOCAIRO_BACKEND=fc \
    ~/projects/tesseract/training/tesstrain.sh \
      --fonts_dir /Library/Fonts \
      --lang eng \
      --linedata_only \
      --noextract_font_properties \
      --exposures "0"    \
      --langdata_dir ~/projects/langdata \
      --tessdata_dir ~/projects/tesseract/tessdata \
      --fontlist "Times New Roman," \
      --output_dir ~/tesstutorial/engeval
    

    Add the prefix PANGOCAIRO_BACKEND=fc if using MAC OSX

    9 - Create the needed folder

    mkdir -p ~/tesstutorial/engoutput
    

    10 - Start the training

    SCROLLVIEW_PATH=~/projects/tesseract/java \
    ~/projects/tesseract/training/lstmtraining \
    --debug_interval 100 \
    --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
    --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
    --model_output ~/tesstutorial/engoutput/base \
    --learning_rate 20e-4 \
    --train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \
    --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \
    --max_iterations 5000 &>~/tesstutorial/engoutput/basetrain.log
    

    Case you failed to build ScrollView.jar, set debug_interval to -1 --debug_interval -1

    11 - Monitor the log on another console

    tail -f ~/tesstutorial/engoutput/basetrain.log
    

    12 - Test Accuracy with other font

    ~/projects/tesseract/training/lstmeval \
      --model ~/tesstutorial/engoutput/base_checkpoint \
      --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
      --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
    

    13 - Test Accuracy with best traindata

    ~/projects/tesseract/training/lstmeval \
      --model ~/projects/tessdata_best/eng.traineddata \
      --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
    

    14 - Test Accuracy with actual traindata (in this case the same as step 13)

    ~/projects/tesseract/training/lstmeval \
      --model ~/projects/tesseract/tessdata/eng.traineddata \
      --eval_listfile ~/tesstutorial/engtrain/eng.training_files.txt
    

    Fine tuning - tesseract 4.0

    Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact

    Steps

    1 - Create the necessary folder

    mkdir -p ~/tesstutorial/verdana_from_small
    

    2 - Start to fine tuning

    ~/projects/tesseract/training/lstmtraining \
      --model_output ~/tesstutorial/verdana_from_small/verdana \
      --continue_from ~/tesstutorial/engoutput/base_checkpoint \
      --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
      --train_listfile ~/tesstutorial/engeval/eng.training_files.txt \
      --max_iterations 1200
    

    3 - Validate the progress

    ~/projects/tesseract/training/lstmeval \
      --model ~/tesstutorial/verdana_from_small/verdana_checkpoint \
      --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
      --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
    

    4 - Create the necessary folder

    mkdir -p ~/tesstutorial/verdana_from_full
    

    5 - Combine the trained data

    ~/projects/tesseract/training/combine_tessdata \
      -e ~/projects/tesseract/tessdata/eng.traineddata \
      ~/tesstutorial/verdana_from_full/eng.lstm
    

    6 - Train merged data

    ~/projects/tesseract/training/lstmtraining \
      --model_output ~/tesstutorial/verdana_from_full/verdana \
      --continue_from ~/tesstutorial/verdana_from_full/eng.lstm \
      --traineddata ~/projects/tesseract/tessdata/eng.traineddata \
      --train_listfile ~/tesstutorial/engeval/eng.training_files.txt \
      --max_iterations 400
    

    7 - Validate the results on the main training file

    ~/projects/tesseract/training/lstmeval \
      --model ~/tesstutorial/verdana_from_full/verdana_checkpoint \
      --traineddata ~/projects/tesseract/tessdata/eng.traineddata \
      --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
    

    8 - Validate the results on our training file

    ~/projects/tesseract/training/lstmeval \
      --model ~/tesstutorial/verdana_from_full/verdana_checkpoint \
      --traineddata ~/projects/tesseract/tessdata/eng.traineddata \
      --eval_listfile ~/tesstutorial/engtrain/eng.training_files.txt
    

    Fine tuning add ± character - tesseract 4.0

    Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters

    Steps

    1 - Modify langdata/eng/eng.training_text and include these lines:

    alkoxy of LEAVES ±1.84% by Buying curved RESISTANCE MARKED Your (Vol. SPANIEL
    TRAVELED ±85¢ , reliable Events THOUSANDS TRADITIONS. ANTI-US Bedroom Leadership
    Inc. with DESIGNS self; ball changed. MANHATTAN Harvey's ±1.31 POPSET Os—C(11)
    VOLVO abdomen, ±65°C, AEROMEXICO SUMMONER = (1961) About WASHING Missouri
    PATENTSCOPE® # © HOME SECOND HAI Business most COLETTI, ±14¢ Flujo Gilbert
    Dresdner Yesterday's Dilated SYSTEMS Your FOUR ±90° Gogol PARTIALLY BOARDS firm
    Email ACTUAL QUEENSLAND Carl's Unruly ±8.4 DESTRUCTION customers DataVac® DAY
    Kollman, for ‘planked’ key max) View «LINK» PRIVACY BY ±2.96% Ask! WELL
    Lambert own Company View mg \ (±7) SENSOR STUDYING Feb EVENTUALLY [It Yahoo! Tv
    United by #DEFINE Rebel PERFORMED ±500Gb Oliver Forums Many | ©2003-2008 Used OF
    Avoidance Moosejaw pm* ±18 note: PROBE Jailbroken RAISE Fountains Write Goods (±6)
    Oberflachen source.” CULTURED CUTTING Home 06-13-2008, § ±44.01189673355 €
    netting Bookmark of WE MORE) STRENGTH IDENTICAL ±2? activity PROPERTY MAINTAINED
    

    2 - Generate the training file

    PANGOCAIRO_BACKEND=fc \
    ~/projects/tesseract/training/tesstrain.sh \
      --fonts_dir /Library/Fonts \
      --lang eng \
      --linedata_only \
      --noextract_font_properties \
      --langdata_dir ~/projects/langdata \
      --tessdata_dir ~/projects/tesseract/tessdata \
      --fontlist "Times New Roman," \
                  "Times New Roman, Bold" \
                  "Times New Roman, Bold Italic" \
                  "Times New Roman, Italic" \
                  "Courier New" \
                  "Courier New Bold" \
                  "Courier New Bold Italic" \
                  "Courier New Italic" \
      --output_dir ~/tesstutorial/trainplusminus
    

    3 - Generate the eval data

    PANGOCAIRO_BACKEND=fc \
    ~/projects/tesseract/training/tesstrain.sh \
      --fonts_dir /Library/Fonts \
      --lang eng \
      --linedata_only \
      --noextract_font_properties \
      --langdata_dir ~/projects/langdata \
      --tessdata_dir ~/projects/tesseract/tessdata \
      --fontlist "Verdana" \
      --output_dir ~/tesstutorial/evalplusminus
    

    4 - Combine trained data files

    ~/projects/tesseract/training/combine_tessdata \
      -e ~/projects/tesseract/tessdata/eng.traineddata \
      ~/tesstutorial/trainplusminus/eng.lstm
    

    5 - Fine tuning

    ~/projects/tesseract/training/lstmtraining \
      --model_output ~/tesstutorial/trainplusminus/plusminus \
      --continue_from ~/tesstutorial/trainplusminus/eng.lstm \
      --traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
      --old_traineddata ~/projects/tesseract/tessdata/eng.traineddata \
      --train_listfile ~/tesstutorial/trainplusminus/eng.training_files.txt \
      --max_iterations 3600
    

    6 - Test the result on other fonts

    ~/projects/tesseract/training/lstmeval \
      --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
      --traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
      --eval_listfile ~/tesstutorial/trainplusminus/eng.training_files.txt
    

    6 - Test the result test on main font

    ~/projects/tesseract/training/lstmeval \
      --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
      --traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
      --eval_listfile ~/tesstutorial/evalplusminus/eng.training_files.txt
    
  • Some programs can't find OCR text in Tesseract's PDFs (3.04)

    Some programs can't find OCR text in Tesseract's PDFs (3.04)

    While Acrobat XI can find text in a PDF, it appears that poppler's pdftotext program, OS X's Preview app, and the library PyPDF2's extractText() function all fail to locate text. It seems that Tesseract is encoding text in a way that makes it inaccessible to many PDF viewers.

    pdftotext produces empty output. Preview app allows highlighting of text in the appropriate locations, but it cannot be copied to the clipboard or searched. PyPDF2 extractText also produces an empty string as text.

  • Tesseract v5: hocr_font_info 1 dont return font name (x_font)

    Tesseract v5: hocr_font_info 1 dont return font name (x_font)

    Hi,

    I'm trying to run tesseract's last version with hocr and hocr_font_info activated to obtain the name and size of the font.

    This is how i call tesseract: pytesseract.pytesseract.run_tesseract(IMA_PATH, 'output_hocr', extension='jpg',config="hocr --tessedit_create_hocr 1 --hocr_font_info 1 --oem 0")

    With the input image: font1

    And this is the output:

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
     <head>
      <title></title>
      <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
      <meta name='ocr-system' content='tesseract v5.0.1.20220118' />
      <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/>
     </head>
     <body>
      <div class='ocr_page' id='page_1' title='image "./images/fonts/font1.JPG"; bbox 0 0 393 177; ppageno 0; scan_res 96 96'>
       <div class='ocr_carea' id='block_1_1' title="bbox 15 19 365 156">
        <p class='ocr_par' id='par_1_1' lang='ocr_tess_v1' title="bbox 15 19 365 156">
         <span class='ocr_line' id='line_1_1' title="bbox 17 19 210 60; baseline 0 -1; x_size 54.08889; x_descenders 13.522223; x_ascenders 13.522223">
          <span class='ocrx_word' id='word_1_1' title='bbox 17 19 210 60; x_wconf 90'>FUENTE</span>
         </span>
         <span class='ocr_line' id='line_1_2' title="bbox 15 114 365 156; baseline 0 -1; x_size 54.08889; x_descenders 13.522223; x_ascenders 13.522223">
          <span class='ocrx_word' id='word_1_2' title='bbox 15 114 202 156; x_wconf 44'>CALIBRI</span>
          <span class='ocrx_word' id='word_1_3' title='bbox 226 114 365 156; x_wconf 44'>BODY</span>
         </span>
        </p>
       </div>
      </div>
     </body>
    </html>
    

    As you can see the font name or x_font is not returned. I read in other issues (https://github.com/tesseract-ocr/tesseract/issues/684) that this problem could be by the tesseract version.

    Is possible to obtain the font name in the last tesseract version?

     print("Tesseract version: ", pytesseract.get_tesseract_version())
             Tesseract version:  5.0.1.20220118
    

    Thanks in advance!!

  • Text2Image isn't working properly

    Text2Image isn't working properly

    I'm trying to retrain this Tesseract Engine (https://gitlab.com/pninim.org/tessdata_heb_rashi/-/blob/main/tesseract_4.1.1/TRAINING.md) for a specific obscure Hebrew Script for Tesseract 5. I'm trying to, using the command listed there, get a list of available fonts using text2image --list_available_fonts --fonts_dir FontsRashi/Working which initially worked but has ceased to do so.

    Environment

    • Tesseract Version: 5.0.0
    • Commit Number:
    • Platform: 64 Bit Fedora 35

    Current Behavior: Displays (process:98484): Pango-CRITICAL **: 23:45:52.231: pango_font_description_set_size: assertion 'size >= 0' failed followed by what seems like a list of fonts installed on the system.

    Expected Behavior: List the Fonts available in a directory

    Suggested Fix: No idea. I need help troubleshooting this issue. Expected behavior was demonstrated until very recently despite the fact that I seem to be using the same install since I built from source (I don't remember the commit used)

    Below are some photos relevant to the error. image image image image image

  • Different results on debian machines compared to windows & mac

    Different results on debian machines compared to windows & mac

    Environment

    • Tesseract Version: 5.0.1
    • Commit Number: 424b17f997363670d187f42c43408c472fe55053
    • Platform: Linux girid 4.19.0-20-amd64 # 1 SMP Debian 4.19.235-1 x86_64 GNU/Linux

    Current Behavior:

    I am using the following version of tesseract on a debian machine.

    $ tesseract --version tesseract 5.0.1-42-g424b leptonica-1.83.0 libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.36 : libtiff 4.1.0 : zlib 1.2.11 Found AVX2 Found AVX Found FMA Found SSE4.1

    And I am trying to match its accuracy across the three platforms (win, mac, and debian) for my application. However, I noticed that debian is producing different results compared to the other 2 platform.

    I have attached a sample image in which the results were different. I tried disabling the optimizations related to AVX2, AVX, FMA and SSE and that did not work.

    Expected Behavior:

    Ideally, the models should produce the same result on the same image across platforms.

    longScannedDoc

  • Question: Can anybody point me to code that handles splitting word into characters?

    Question: Can anybody point me to code that handles splitting word into characters?

    I have hard time navigating the code.

    I located PAGE_RES_IT::ReplaceCurrentWord that replaces fake boxes with real ones, but I can't locate where src_b and rej_b are assembled.

    I have some boxes wrong, most likely related to diplopia, and I am looking for the cause.

  • Different OCR results for the same image - file from disk vs from memory

    Different OCR results for the same image - file from disk vs from memory

    Environment

    • Tesseract Version: 5.1.0
    • Platform: Windows 32-bit, compiled under MSVC 2017

    Current Behavior:

    I am writing an application where I can select different pieces of text in the preview. For performance reasons, I save this text in memory and then OCR it. The problem is that in some cases I get unnecessary characters like underline, semicolon that I don't normally see in the image (see example below for file.png). I am using Germany language trained model with PSM 1 = Automatic page segmentation with OSD.

    See: file.png

    1. OCR from load from disk

    image

    2. OCR from memory:

    image

    Of course, there are more differences - I did a few tests - the results depend on how I select the text see below:

    image

    I am using CAPI and I am writing the code in C ++. I am using TessBaseAPISetImage functions. I don't know what the difference is from. Do you have any ideas what I could do to get the same OCR results from memory as read from a file?

    image

    Expected Behavior:

    I would expect OCR from memory to return correct results, whether the image is read from disk or memory.

    Suggested Fix:

    Unifying OCR from memory and file.

  • generate pdf error with tesseract took too long to OCR - skipping

    generate pdf error with tesseract took too long to OCR - skipping

    Environment

    macOS Montenerey 12.2.1 with intel chip *Tesseract Version: 5.1.0 Platform : xnu-8019.80.24~20/RELEASE_X86_64 x86_64

    Current Behavior: it fails while ocr'ing a 50 pages document. it fails on page 29 with an error Exit Code Exception. There were other errors before the last error [tesseract] Error in pixCreateHeader: requested w=199558, h=27677, d=32

    Expected Behavior: ocr the pdf and generate a text file

    command: ocrmypdf --sidecar output.txt doc.pdf output.pdf --force-ocr --max-image-mpixels 551306766 -v

    Suggested Fix:

    Screenshot 2022-05-05 at 18 52 21 Screenshot 2022-05-05 at 18 53 06

Dec 31, 2021
Discovery is an open-source Discord Bot with the main features Tickets, Moderation, Giveaways and Reaction roles.

Discovery is an open-source Discord Bot with the main features Tickets, Moderation, Giveaways and Reaction roles.

Dec 29, 2021
(@Tablada32BOT is my bot in twitter) This is a simple bot, its main and only function is to reply to tweets where they mention their bot with their @

Remember If you are going to host your twitter bot on a page where they can read your code, I recommend that you create an .env file and put your twit

Jun 4, 2021
A Telegram Bot for adding Footer caption beside main caption of Telegram Channel Messages.

Footer-Bot A Telegram Bot for adding Footer caption beside main caption of Telegram Channel Messages. Best for Telegram Movie Channels. Made by @AbirH

May 1, 2022
The Main Pythonic Version Of Twig Using Nextcord
The Main Pythonic Version Of Twig Using Nextcord

The Main Pythonic Version Of Twig Using Nextcord

Mar 21, 2022
thumbor is an open-source photo thumbnail service by globo.com
thumbor is an open-source photo thumbnail service by globo.com

Survey If you use thumbor, please take 1 minute and answer this survey? It's only 2 questions and one is multiple choice!!! thumbor is a smart imaging

May 15, 2022
Free and Open Source Machine Translation API. 100% self-hosted, no limits, no ties to proprietary services. Built on top of Argos Translate.
Free and Open Source Machine Translation API. 100% self-hosted, no limits, no ties to proprietary services. Built on top of Argos Translate.

LibreTranslate Try it online! | API Docs Free and Open Source Machine Translation API, entirely self-hosted. Unlike other APIs, it doesn't rely on pro

May 21, 2022
The first open-source PyTgCalls-based project.

SU Music Player — The first open-source PyTgCalls based Pyrogram bot to play music in voice chats Requirements FFmpeg NodeJS 15+ Python 3.7+ Deploymen

May 15, 2022
The best (and now open source) Discord selfbot.

React Selfbot Yes, for real Why am I making this open source? Because can't stop calling my product a rat, tokenlogger and what else not. But there is

Apr 22, 2022
An Open-Source Discord bot created to provide basic functionality which should be in every discord guild. We use this same bot with additional configurations for our guilds.

A Discord bot completely written to be taken from the source and built according to your own custom needs. This bot supports some core features and is

Jan 11, 2022
🎀 First and most powerfull open source clicktune botter
🎀 First and most powerfull open source clicktune botter

CTB ?? Follow me here: Discord | YouTube | Twitter | Github ?? Features: /* *- The first *- Fast *- Proxy support: http/s, socks4/5, premieum (w

Mar 14, 2022
Bifrost C2. Open-source post-exploitation using Discord API
Bifrost C2. Open-source post-exploitation using Discord API

Bifrost Command and Control What's Bifrost? Bifrost is an open-source Discord BOT that works as Command and Control (C2). This C2 uses Discord API for

May 13, 2022
PyLyrics Is An [Open-Source] Bot That Can Help You Get Song Lyrics
PyLyrics Is An [Open-Source] Bot That Can Help You Get Song Lyrics

PyLyrics-Bot Telegram Bot To Search Song Lyrics From Genuis. ?? Demo: ??‍?? Deploy: ❤ Deploy Your Own Bot : Star ?? Fork ?? & Deploy -Easy Way -Self-h

Apr 29, 2022
Maestral is an open-source Dropbox client written in Python.
Maestral is an open-source Dropbox client written in Python.

Maestral - A light-weight and open-source Dropbox client for macOS and Linux

May 16, 2022
This is a Innexia Chat Bot Open Source Code 🤬

⚡ Innexia ⚡ A Powerful, Smart And Simple Chat Bot ... Written with Python... Available on Telegram as @InnexiaChatBot ❤️ Support ⭐️ Thanks to everyone

Sep 19, 2021
The open source version of Tentro - A multipurpose Discord bot.

Welcome to Tentro ?? A multipurpose Discord bot. ?? Homepage Install pip install -r requirements.txt Usage py Tentro.py Contributors ?? Tentro Dev Tea

Feb 20, 2022
A free and open-source discord webhook spammer.

Discord-Webhook-Spammer A free and open-source discord webhook spammer. Usage Depending on your python installation your commands may vary. Below are

Sep 8, 2021
An open source API to validate the EU Covid Certificates / Green Certificates
An open source API to validate the EU Covid Certificates / Green Certificates

Open Covid Certificate Validator This an open source API to validate EU Digital COVID Certificates. It receives a COVID certificate and validates it u

Apr 28, 2022
This is a open source discord bot project

pythonDiscordBot This is a open source discord bot project #based on the MAX A video: https://www.youtube.com/watch?v=jHZlvRr9KxM Prerequisites Python

Oct 11, 2021