EnglishFrenchSpanish

OnWorks favicon

pdfsandwich - Online in the Cloud

Run pdfsandwich in OnWorks free hosting provider over Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator

This is the command pdfsandwich that can be run in the OnWorks free hosting provider using one of our multiple free online workstations such as Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator

PROGRAM:

NAME


pdfsandwich - A generator for sandwich OCR pdfs from scanned pdf files

SYNOPSIS


pdfsandwich [options] inputfile.pdf

DESCRIPTION


pdfsandwich generates "sandwich" OCR pdf files, i.e. pdf files which contain only images
(no text) will be processed by optical character recognition (OCR) and the text will be
added to each page invisibly "behind" the images. Note that pdfsandwich needs the
following programs: unpaper, convert, gs, hocr2pdf (for tesseract < 3.03), and tesseract.
As tesseract >= 3.03 can write pdf files, hocr2pdf is only needed for older versions of
tesseract. Please visit http://www.tobias-elze.de/pdfsandwich.

OPTIONS


-convert
-convert filename : name of convert binary (default: convert)

-coo -coo options : additional convert options; make sure to quote; e.g. -coo
"-normalize -black-threshold 75%" call convert --help or man convert for all
convert options

-debug keep all temporary files in /tmp (for debugging)

-enforcehocr2pdf
use hocr2pdf even if tesseract >= 3.03

-first_page
-first_page number : number of page to start OCR from (default: 1)

-grayfilter
enable unpaper's gray filter; further options can be set by -unpo

-gs -gs filename : name of gs binary (default: gs)

-hocr2pdf
-hocr2pdf filename : name of hocr2pdf binary (default: hocr2pdf); ignored for
tesseract >= 3.03 unless option -enforcehocr2pdf is set

-hoo -hoo options : additional hocr2pdf options; make sure to quote

-identify
-identify filename : name of identify binary (default: identify)

-last_page
-last_page number : number of page up to which to process OCR (default: number of
pages in inputfile)

-lang -lang language : language of the text; option to tesseract (defaut: eng) e.g: eng,
deu, deu-frak, fra, rus, swe, spa, ita, ... see option -list_langs; Multiple
languages may be specified, separated by plus characters.

-layout
-layout { single | double | none } : layout of the scanned pages; requires unpaper
single: one page per sheet double: two pages per sheet none: no auto-layout
(default)

-list_langs
list currently available languages and exit; in case of custom binaries of
tesseract, place this after the -tesseract option

-maxpixels
-maxpixels NUM : maximal number of pixels allowed for input file if
(resolution/72)^2 *width*height > maxpixels then scale page of input file down
prior to OCR so that page size in pixels corresponds to maxpixels; default:
17415167 (A3 @ 300 dpi)

-noimage
do not place the image over the text (requires hocr2pdf; ignored without
-enforcehocr2pdf option)

-nopreproc
do not preprocess with unpaper

-nthreads
-nthreads number : number of parallel threads (default: guessed number of CPUs; if
guessing fails: 1)

-o -o filename : output file; default: inputfile_ocr.pdf (if extension is different
from .pdf, original extension is kept)

-pagesize
-pagesize { original | NUMxNUM } : set page size of output pdf original: same as
input file (default) NUMxNUM: width x height in pixel (e.g. for A4: -pagesize
595x842)

-resolution
-resolution NUM : resolution (dpi) used for OCR (default: 300)

-rgb use RGB color space for images (default: black and white); use with care: causes
problems with some color spaces

-sloppy_text
sloppily place text, group words, do not draw single glyphs; ignored for tesseract
>= 3.03 unless option -enforcehocr2pdf is set

-tesseract
-tesseract filename : name of tesseract binary (default: tesseract)

-tesso -tesso options : additional tesseract options; make sure to quote

-unpaper
-unpaper filename : name of unpaper binary (default: unpaper)

-unpo -unpo options : additional unpaper options; make sure to quote

-quiet suppress output

-verbose
produce more output

-version
print version and quit

-help Display this list of options

--help Display this list of options

LANGUAGES


Via Tesseract, numerous language packagess available - follow this link
http://code.google.com/p/tesseract-ocr/downloads/list for a complete list. Here is an
incomplete selection of supported languages and their abbreviations:

ara (Arabic), aze (Azerbauijani), bul (Bulgarian), cat (Catalan), ces (Czech), chi_sim
(Simplified Chinese), chi_tra (Traditional Chinese), chr (Cherokee), dan (Danish), dan-
frak (Danish (Fraktur)), deu (German), ell (Greek), eng (English), enm (Old English), epo
(Esperanto), est (Estonian), fin (Finnish), fra (French), frm (Old French), glg
(Galician), heb (Hebrew), hin (Hindi), hrv (Croation), hun (Hungarian), ind (Indonesian),
ita (Italian), jpn (Japanese), kor (Korean), lav (Latvian), lit (Lithuanian), nld (Dutch),
nor (Norwegian), pol (Polish), por (Portuguese), ron (Romanian), rus (Russian), slk
(Slovakian), slv (Slovenian), sqi (Albanian), spa (Spanish), srp (Serbian), swe (Swedish),
tam (Tamil), tel (Telugu), tgl (Tagalog), tha (Thai), tur (Turkish), ukr (Ukrainian), vie
(Vietnamese)

Multiple languages may be specified, separated by plus characters. Note that the
respective tesseract language package needs to be installed on your system to be usable by
pdfsandwich. Option -list_langs lists the languages which are available on your system.

AVAILABILITY


Sources and packages as well as comprehensive help can be found at http://www.tobias-
elze.de/pdfsandwich.

Use pdfsandwich online using onworks.net services


Free Servers & Workstations

Download Windows & Linux apps

  • 1
    archlabs_repo
    archlabs_repo
    Package repo for ArchLabs This is an
    application that can also be fetched
    from
    https://sourceforge.net/projects/archlabs-repo/.
    It has been hosted in OnWorks in...
    Download archlabs_repo
  • 2
    Zephyr Project
    Zephyr Project
    The Zephyr Project is a new generation
    real-time operating system (RTOS) that
    supports multiple hardware
    architectures. It is based on a
    small-footprint kernel...
    Download Zephyr Project
  • 3
    SCons
    SCons
    SCons is a software construction tool
    that is a superior alternative to the
    classic "Make" build tool that
    we all know and love. SCons is
    implemented a...
    Download SCons
  • 4
    PSeInt
    PSeInt
    PSeInt is a pseudo-code interpreter for
    spanish-speaking programming students.
    Its main purpose is to be a tool for
    learning and understanding the basic
    concep...
    Download PSeInt
  • 5
    oStorybook
    oStorybook
    oStorybook l'outil privil�gi� des
    �crivains. ATTENTION : voir sur
    http://ostorybook.tuxfamily.org/v5/
    --en_EN oStorybook the right tool for
    writers. WARNIN...
    Download oStorybook
  • 6
    Asuswrt-Merlin
    Asuswrt-Merlin
    Asuswrt-Merlin is a third party
    firmware for select Asus wireless
    routers. Based on the Asuswrt firmware
    developed by Asus, it brings tweaks, new
    features and ...
    Download Asuswrt-Merlin
  • More »

Linux commands

  • 1
    aafigure
    aafigure
    aafigure - convert ASCII art to an
    image ...
    Run aafigure
  • 2
    aafire
    aafire
    aafire, aainfo, aasavefont, aatest -
    aalib example programs ...
    Run aafire
  • 3
    coqtop.opt
    coqtop.opt
    coqtop.opt - The native-code Coq
    toplevel ...
    Run coqtop.opt
  • 4
    coqwc
    coqwc
    coqwc - print the number of
    specification, proof and comment lines
    in Coq files ...
    Run coqwc
  • 5
    g15composer
    g15composer
    g15composer - Scriptable command
    interface to libg15render(3) drawing
    functions DESCRIPTION: G15composer is a
    scriptable command interface to the
    libg15render ...
    Run g15composer
  • 6
    g15daemon
    g15daemon
    g15daemon - provides access to extra
    keys and the LCD available on the
    logitech G15 keyboard. DESCRIPTION:
    G15Daemon allows users access to all
    extra keys by d...
    Run g15daemon
  • More »

Ad