enconv - Online in the Cloud

This is the command enconv that can be run in the OnWorks free hosting provider using one of our multiple free online workstations such as Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator

PROGRAM:

NAME


enca -- detect and convert encoding of text files

SYNOPSIS


enca [-L LANGUAGE] [OPTION]... [FILE]...
enconv [-L LANGUAGE] [OPTION]... [FILE]...

INTRODUCTION AND EXAMPLES


If you are lucky enough, the only two things you will ever need to know are: command

enca FILE

will tell you which encoding file FILE uses (without changing it), and

enconv FILE

will convert file FILE to your locale native encoding. To convert the file to some other
encoding use the -x option (see -x entry in section OPTIONS and sections CONVERSION and
ENCODINGS for details).

Both work with multiple files and standard input (output) too. E.g.

enca -x latin2 <sometext | lpr

assures file `sometext' is in ISO Latin 2 when it's sent to printer.

The main reason why these command will fail and turn your files into garbage is that Enca
needs to know their language to detect the encoding. It tries to determine your language
and preferred charset from locale settings, which might not be what you want.

You can (or have to) use -L option to tell it the right language. Suppose, you downloaded
some Russian HTML file, `file.htm', it claims it's windows-1251 but it isn't. So you run

enca -L ru file.htm

and find out it's KOI8-R (for example). Be warned, currently there are not many supported
languages (see section LANGUAGES).

Another warning concerns the fact several Enca's features, namely its charset conversion
capabilities, strongly depend on what other tools are installed on your system (see
section CONVERSION)--run

enca --version

to get list of features (see section FEATURES). Also try

enca --help

to get description of all other Enca options (and to find the rest of this manual page
redundant).

DESCRIPTION


Enca reads given text files, or standard input when none are given, and uses knowledge
about their language (must be supported by you) and a mixture of parsing, statistical
analysis, guessing and black magic to determine their encodings, which it then prints to
standard output (or it confesses it doesn't have any idea what the encoding could be). By
default, Enca presents results as a multiline human-readable descriptions, several other
formats are available--see Output type selectors below.

Enca can also convert files to some other encoding ENC when you ask for it--either using a
built-in converter, some conversion library, or by calling an external converter.

Enca's primary goal is to be usable unattended, as an automatic conversion tool, though it
perhaps have not reached this point yet (please see section SECURITY).

Please note except rare cases Enca really has to know the language of input files to give
you a reliable answer. On the other hand, it can then cope quite well with files that are
not purely textual or even detect charset of text strings inside some binary file; of
course, it depends on the character of the non-text component.

Enca doesn't care about structure of input files, it views them as a uniform piece of
text/data. In case of multipart files (e.g. mailboxes), you have to use some tool knowing
the structure to extract the individual parts first. It's the cost of ability to detect
encodings of any damaged, incomplete or otherwise incorrect files.

OPTIONS


There are several categories of options: operation mode options, output type selectors,
guessing parameters, conversion parameters, general options and listings.

All long options can be abbreviated as long as they are unambiguous, mandatory parameters
of long options are mandatory for short options too.

Operation modes
are following:

-c, --auto-convert
Equivalent to calling Enca as enconv.

If no output type selector is specified, detect file encodings, guess your
preferred charset from locales, and convert files to it (only available with
+target-charset-auto feature).

-g, --guess
Equivalent to calling Enca as enca.

If no output type selector is specified, detect file encodings and report them.

Output type selectors
select what action Enca will take when it determines the encoding; most of them just
choose between different names, formats and conventions how encodings can be printed, but
one of them (-x) is special: it tells Enca to recode files to some other encoding ENC.
These options are mutually exclusive; if you specify more than one output type selector
the last one takes precedence.

Several output types represent charset name used by some other program, but not all these
programs know all the charsets which Enca recognises. Be warned, Enca makes no difference
between unrecognised charset and charset having no name in given namespace in such
situations.

-d, --details
It used to print a few pages of details about the guessing process, but since Enca
is just a program linked against Enca library, this is not possible and this option
is roughly equivalent to --human-readable, except it reports failure reason when
Enca doesn't recognize the encoding.

-e, --enca-name
Prints Enca's nice name of the charset, i.e., perhaps the most generally accepted
and more or less human-readable charset identifier, with surfaces appended.

This name is used when calling an external converter, too.

-f, --human-readable
Prints verbal description of the detected charset and surfaces--something a human
understands best. This is the default behaviour.

The precise format is following: the first line contains charset name alone, and
it's followed by zero or more indented lines containing names of detected surfaces.
This format is not, however, suitable or intended for further machine-processing,
and the verbal charset descriptions are like to change in the future.

-i, --iconv-name
Prints how iconv(3) (and/or iconv(1)) calls the detected charset. More precisely,
it prints one, more or less arbitrarily chosen, alias accepted by iconv. A charset
unknown to iconv counts as unknown.

This output type makes sense only when Enca is compiled with iconv support (feature
+iconv-interface).

-r, --rfc1345-name
Prints RFC 1345 charset name. When such a name doesn't exist because RFC 1345
doesn't define a given encoding, some other name defined in some other RFC or just
the name which author considers `the most canonical', is printed.

Since RFC 1345 doesn't define surfaces, no surface info is appended.

-m, --mime-name
Prints preferred MIME name of detected charset. This is the name you should
normally use when fixing e-mails or web pages.

A charset not present in http://www.iana.org/assignments/character-sets counts as
unknown.

-s, --cstocs-name
Prints how cstocs(1) calls the detected charset. A charset unknown to cstocs
counts as unknown.

-n, --name=WORD
Prints charset (encoding) name selected by WORD (can be abbreviated as long as is
unambiguous). For names listed above, --name=WORD is equivalent to --WORD.

Using aliases as the output type causes Enca to print list of all accepted aliases
of detected charset.

-x, --convert-to=[..]ENC
Converts file to encoding ENC.

The optional `..' before encoding name has no special meaning, except you can use
it to remind yourself that, unlike in recode(1), you should specify desired
encoding, instead of current.

You can use recode(1) recoding chains or any other kind of braindead recoding
specification for ENC, provided that you tell Enca to use some tool understanding
it for conversion (see section CONVERSION).

When Enca fails to determine the encoding, it prints a warning and leaves the the
file as is; when it is run as a filter it tries to do its best to copy standard
input to standard output unchanged. Nevertheless, you should not rely on it and do
backup.

Guessing parameters
There's only one: -L setting language of input files. This option is mandatory (but see
below).

-L, --language=LANG
Sets language of input files to LANG.

More precisely, LANG can be any valid locale name (or alias with +locale-alias
feature) of some supported language. You can also specify `none' as language name,
only multibyte encodings are recognised then. Run

enca --list languages

to get list of supported languages. When you don't specify any language Enca tries
to guess your language from locale settings and assumes input files use this
language. See section LANGUAGES for details.

Conversion parameters
give you finer control of how charset conversion will be performed. They don't affect
anything when -x is not specified as output type. Please see section CONVERSION for the
gory conversion details.

-C, --try-converters=LIST
Appends comma separated LIST to the list of converters that will be tried when you
ask for conversion. Their names can be abbreviated as long as they are
unambiguous. Run

enca --list converters

to get list of all valid converter names (and see section CONVERSION for their
description).

The default list depends on how Enca has been compiled, run

enca --help

to find out default converter list.

Note the default list is used only when you don't specify -C at all. Otherwise,
the list is built as if it were initially empty and every -C adds new converter(s)
to it. Moreover, specifying none as converter name causes clearing the converter
list.

-E, --external-converter-program=PATH
Sets external converter program name to PATH. Default external converter depends
on how enca has been complied, and the possibility to use external converters may
not be available at all. Run

enca --help

to find out default converter program in your enca build.

General options
don't fit to other option categories...

-p, --with-filename
Forces Enca to prefix each result with corresponding file name. By default, Enca
prefixes results with filenames when run on multiple files.

Standard input is printed as STDIN and standard output as STDOUT (the latter can be
probably seen in error messages only).

-P, --no-filename
Forces Enca to not prefix results with file names. By default, Enca doesn't prefix
result with file name when run on a single file (including standard input).

-V, --verbose
Increases verbosity level (each use increases it by one).

Currently this option in not very useful because different parts of Enca respond
differently to the same verbosity level, mostly not at all.

Listings
are all terminal, i.e. when Enca encounters some of them it prints the required listing
and terminates without processing any following options.

-h, --help
Prints brief usage help.

-G, --license
Prints full Enca license (through a pager, if possible).

-l, --list=WORD
Prints list specified by WORD (can be abbreviated as long as it is unambiguous).
Available lists include:

built-in-charsets. All encodings convertible by built-in converter, by group (both
input and output encoding must be from this list and belong to the same group for
internal conversion).

built-in-encodings. Equivalent to built-in-charsets, but considered obsolete; will
be accepted with a warning, for a while.

converters. All valid converter names (to be used with -C).

charsets. All encodings (charsets). You can select what names will be printed
with --name or any name output type selector (of course, only encodings having a
name in given namespace will be printed then), the selector must be specified
before --list.

encodings. Equivalent to charsets, but considered obsolete; will be accepted with
a warning, for a while.

languages. All supported languages together with charsets belonging to them. Note
output type selects language name style, not charset name style here.

names. All possible values of --name option.

lists. All possible values of this option. (Crazy?)

surfaces. All surfaces Enca recognises.

-v, --version
Prints program version and list of features (see section FEATURES).

CONVERSION


Though Enca has been originally designed as a tool for guessing encoding only, it now
features several methods of charset conversion. You can control which of them will be
used with -C.

Enca sequentially tries converters from the list specified by -C until it finds some that
is able to perform required conversion or until it exhausts the list. You should specify
preferred converters first, less preferred later. External converter (extern) should be
always specified last, only as last resort, since it's usually not possible to recover
when it fails. The default list of converters always starts with built-in and then
continues with the first one available from: librecode, iconv, nothing.

It should be noted when Enca says it is not able to perform the conversion it only means
none of the converters is able to perform it. It can be still possible to perform the
required conversion in several steps, using several converters, but to figure out how,
human intelligence is probably needed.

Built-in converter
is the simplest and far the fastest of all, can perform only a few byte-to-byte
conversions and modifies files directly in place (may be considered dangerous, but is
pretty efficient). You can get list of all encodings it can convert with

enca --list built-in

Beside speed, its main advantage (and also disadvantage) is that it doesn't care: it
simply converts characters having a representation in target encoding, doesn't touch
anything else and never prints any error message.

This converter can be specified as built-in with -C.

Librecode converter
is an interface to GNU recode library, that does the actual recoding job. It may or may
not be compiled in; run

enca --version

to find out its availability in your enca build (feature +librecode-interface).

You should be familiar with recode(1) before using it, since recode is a quite
sophisticated and powerful charset conversion tool. You may run into problems using it
together with Enca particularly because Enca's support for surfaces not 100% compatible,
because recode tries too hard to make the transformation reversible, because it sometimes
silently ignores I/O errors, and because it's incredibly buggy. Please see GNU recode
info pages for details about recode library.

This converter can be specified as librecode with -C.

Iconv converter
is an interface to the UNIX98 iconv(3) conversion functions, that do the actual recoding
job. It may or may not be compiled in; run

enca --version

to find out its availability in your enca build (feature +iconv-interface).

While iconv is present on most today systems it only rarely offer some useful set of
available conversions, the only notable exception being iconv from GNU libc. It is
usually quite picky about surfaces, too (while, at the same time, not implementing surface
conversion). It however probably represents the only standard(ized) tool able to perform
conversion from/to Unicode. Please see iconv documentation about for details about its
capabilities on your particular system.

This converter can be specified as iconv with -C.

External converter
is an arbitrary external conversion tool that can be specified with -E option (at most one
can be defined simultaneously). There are some standard, provided together with enca:
cstocs, recode, map, umap, and piconv. All are wrapper scripts: for cstocs(1), recode(1),
map(1), umap(1), and piconv(1).

Please note enca has little control what the external converter really does. If you set
it to /bin/rm you are fully responsible for the consequences.

If you want to make your own converter to use with enca, you should know it is always
called

CONVERTER ENC_CURRENT ENC FILE [-]

where CONVERTER is what has been set by -E, ENC_CURRENT is detected encoding, ENC is what
has been specified with -x, and FILE is the file to convert, i.e. it is called for each
file separately. The optional fourth parameter, -, should cause (when present) sending
result of conversion to standard output instead of overwriting the file FILE. The
converter should also take care of not changing file permissions, returning error code 1
when it fails and cleaning its temporary files. Please see the standard external
converters for examples.

This converter can be specified as extern with -C.

Default target charset
The straightforward way of specifying target charset is the -x option, which overrides any
defaults. When Enca is called as enconv, default target charset is selected exactly the
same way as recode(1) does it.

If the DEFAULT_CHARSET environment variable is set, it's used as the target charset.

Otherwise, if you system provides the nl_langinfo(3) function, current locale's native
charset is used as the target charset.

When both methods fail, Enca complains and terminates.

Reversibility notes
If reversibility is crucial for you, you shouldn't use enca as converter at all (or maybe
you can, with very specifically designed recode(1) wrapper). Otherwise you should at
least know that there four basic means of handling inconvertible character entities:

fail--this is a possibility, too, and incidentally it's exactly what current GNU libc
iconv implementation does (recode can be also told to do it)

don't touch them--this is what enca internal converter always does and recode can do;
though it is not reversible, a human being is usually able to reconstruct the original (at
least in principle)

approximate them--this is what cstocs can do, and recode too, though differently; and the
best choice if you just want to make the accursed text readable

drop them out--this is what both recode and cstocs can do (cstocs can also replace these
characters by some fixed character instead of mere ignoring); useful when the
to-be-omitted characters contain only noise.

Please consult your favourite converter manual for details of this issue. Generally, if
you are not lucky enough to have all convertible characters in you file, manual
intervention is needed anyway.

Performance notes
Poor performance of available converters has been one of main reasons for including
built-in converter in enca. Try to use it whenever possible, i.e. when files in
consideration are charset-clean enough or charset-messy enough so that its zero built-in
intelligence doesn't matter. It requires no extra disk space nor extra memory and can
outperform recode(1) more than 10 times on large files and Perl version (i.e. the faster
one) of cstocs(1) more than 400 times on small files (in fact it's almost as fast as mere
cp(1)).

Try to avoid external converters when it's not absolutely necessary since all the forking
and moving stuff around is incredibly slow.

ENCODINGS


You can get list of recognised character sets with

enca --list charsets

and using --name parameter you can select any name you want to be used in the listing.
You can also list all surfaces with

enca --list surfaces

Encoding and surface names are case insensitive and non-alphanumeric characters are not
taken into account. However, non-alphanumeric characters are mostly not allowed at all.
The only allowed are: `-', `_', `.', `:', and `/' (as charset/surface separator). So
`ibm852' and `IBM-852' are the same, while `IBM 852' is not accepted.

Charsets
Following list of recognised charsets uses Enca's names (-e) and verbal descriptions as
reported by Enca (-f):

ASCII 7bit ASCII characters
ISO-8859-2 ISO 8859-2 standard; ISO Latin 2
ISO-8859-4 ISO 8859-4 standard; Latin 4
ISO-8859-5 ISO 8859-5 standard; ISO Cyrillic
ISO-8859-13 ISO 8859-13 standard; ISO Baltic; Latin 7
ISO-8859-16 ISO 8859-16 standard
CP1125 MS-Windows code page 1125
CP1250 MS-Windows code page 1250
CP1251 MS-Windows code page 1251
CP1257 MS-Windows code page 1257; WinBaltRim
IBM852 IBM/MS code page 852; PC (DOS) Latin 2
IBM855 IBM/MS code page 855
IBM775 IBM/MS code page 775
IBM866 IBM/MS code page 866
baltic ISO-IR-179; Baltic
KEYBCS2 Kamenicky encoding; KEYBCS2
macce Macintosh Central European

maccyr Macintosh Cyrillic
ECMA-113 Ecma Cyrillic; ECMA-113
KOI-8_CS_2 KOI8-CS2 code (`T602')
KOI8-R KOI8-R Cyrillic
KOI8-U KOI8-U Cyrillic
KOI8-UNI KOI8-Unified Cyrillic
TeX (La)TeX control sequences
UCS-2 Universal character set 2 bytes; UCS-2; BMP
UCS-4 Universal character set 4 bytes; UCS-4; ISO-10646
UTF-7 Universal transformation format 7 bits; UTF-7
UTF-8 Universal transformation format 8 bits; UTF-8
CORK Cork encoding; T1
GBK Simplified Chinese National Standard; GB2312
BIG5 Traditional Chinese Industrial Standard; Big5
HZ HZ encoded GB2312
unknown Unrecognized encoding

where unknown is not any real encoding, it's reported when Enca is not able to give a
reliable answer.

Surfaces
Enca has some experimental support for so-called surfaces (see below). It detects
following surfaces (not all can be applied to all charsets):

/CR CR line terminators
/LF LF line terminators
/CRLF CRLF line terminators
N.A. Mixed line terminators
N.A. Surrounded by/intermixed with non-text data
/21 Byte order reversed in pairs (1,2 -> 2,1)
/4321 Byte order reversed in quadruples (1,2,3,4 -> 4,3,2,1)
N.A. Both little and big endian chunks, concatenated
/qp Quoted-printable encoded

Note some surfaces have N.A. in place of identifier--they cannot be specified on command
line, they can only be reported by Enca. This is intentional because they only inform you
why the file cannot be considered surface-consistent instead of representing a real
surface.

Each charset has its natural surface (called `implied' in recode) which is not reported,
e.g., for IBM 852 charset it's `CRLF line terminators'. For UCS encodings, big endian is
considered as natural surface; unusual byte orders are constructed from 21 and 4321
permutations: 2143 is reported simply as 21, while 3412 is reported as combination of 4321
and 21.

Doubly-encoded UTF-8 is neither charset nor surface, it's just reported.

About charsets, encodings and surfaces
Charset is a set of character entities while encoding is its representation in the terms
of bytes and bits. In Enca, the word encoding means the same as `representation of text',
i.e. the relation between sequence of character entities constituting the text and
sequence of bytes (bits) constituting the file.

So, encoding is both character set and so-called surface (line terminators, byte order,
combining, Base64 transformation, etc.). Nevertheless, it proves convenient to work with
some {charset,surface} pairs as with genuine charsets. So, as in recode(1), all UCS- and
UTF- encodings of Universal character set are called charsets. Please see recode
documentation for more details of this issue.

The only good thing about surfaces is: when you don't start playing with them, neither
Enca won't start and it will try to behave as much as possible as a surface-unaware
program, even when talking to recode.

LANGUAGES


Enca needs to know the language of input files to work reliably, at least in case of
regular 8bit encoding. Multibyte encodings should be recognised for any Latin, Cyrillic
or Greek language.

You can (or have to) use -L option to tell Enca the language. Since people most often
work with files in the same language for which they have configured locales, Enca tries
tries to guess the language by examining value of LC_CTYPE and other locale categories
(please see locale(7)) and using it for the language when you don't specify any. Of
course, it may be completely wrong and will give you nonsense answers and damage your
files, so please don't forget to use the -L option. You can also use ENCAOPT environment
variable to set a default language (see section ENVIRONMENT).

Following languages are supported by Enca (each language is listed together with supported
8bit encodings).

Belarusian CP1251 IBM866 ISO-8859-5 KOI8-UNI maccyr IBM855
Bulgarian CP1251 ISO-8859-5 IBM855 maccyr ECMA-113
Czech ISO-8859-2 CP1250 IBM852 KEYBCS2 macce KOI-8_CS_2 CORK
Estonian ISO-8859-4 CP1257 IBM775 ISO-8859-13 macce baltic
Croatian CP1250 ISO-8859-2 IBM852 macce CORK
Hungarian ISO-8859-2 CP1250 IBM852 macce CORK
Lithuanian CP1257 ISO-8859-4 IBM775 ISO-8859-13 macce baltic
Latvian CP1257 ISO-8859-4 IBM775 ISO-8859-13 macce baltic
Polish ISO-8859-2 CP1250 IBM852 macce ISO-8859-13 ISO-8859-16 baltic CORK
Russian KOI8-R CP1251 ISO-8859-5 IBM866 maccyr
Slovak CP1250 ISO-8859-2 IBM852 KEYBCS2 macce KOI-8_CS_2 CORK
Slovene ISO-8859-2 CP1250 IBM852 macce CORK
Ukrainian CP1251 IBM855 ISO-8859-5 CP1125 KOI8-U maccyr
Chinese GBK BIG5 HZ
none

The special language none can be shortened to __, it contains no 8bit encodings, so only
multibyte encodings are detected.

You can also use locale names instead of languages:

Belarusian be
Bulgarian bg
Czech cs
Estonian et
Croatian hr
Hungarian hu
Lithuanian lt
Latvian lv
Polish pl
Russian ru
Slovak sk
Slovene sl
Ukrainian uk
Chinese zh

FEATURES


Several Enca's features depend on what is available on your system and how it was
compiled. You can get their list with

enca --version

Plus sign before a feature name means it's available, minus sign means this build lacks
the particular feature.

librecode-interface. Enca has interface to GNU recode library charset conversion
functions.

iconv-interface. Enca has interface to UNIX98 iconv charset conversion functions.

external-converter. Enca can use external conversion programs (if you have some suitable
installed).

language-detection. Enca tries to guess language (-L) from locales. You don't need the
--language option, at least in principle.

locale-alias. Enca is able to decrypt locale aliases used for language names.

target-charset-auto. Enca tries to detect your preferred charset from locales. Option
--auto-convert and calling Enca as enconv works, at least in principle.

ENCAOPT. Enca is able to correctly parse this environment variable before command line
parameters. Simple stuff like ENCAOPT="-L uk" will work even without this feature.

ENVIRONMENT


The variable ENCAOPT can hold set of default Enca options. Its content is interpreted
before command line arguments. Unfortunately, this doesn't work everywhere (must have
+ENCAOPT feature).

LC_CTYPE, LC_COLLATE, LC_MESSAGES (possibly inherited from LC_ALL or LANG) is used for
guessing your language (must have +language-detection feature).

The variable DEFAULT_CHARSET can be used by enconv as the default target charset.

DIAGNOSTICS


Enca returns exit code 0 when all input files were successfully proceeded (i.e. all
encodings were detected and all files were converted to required encoding, if conversion
was asked for). Exit code 1 is returned when Enca wasn't able to either guess encoding or
perform conversion on any input file because it's not clever enough. Exit code 2 is
returned in case of serious (e.g. I/O) troubles.

SECURITY


It should be possible to let Enca work unattended, it's its goal. However:

There's no warranty the detection works 100%. Don't bet on it, you can easily lose
valuable data.

Don't use enca (the program), link to libenca instead if you want anything resembling
security. You have to perform the eventual conversion yourself then.

Don't use external converters. Ideally, disable them compile-time.

Be aware of ENCAOPT and all the built-in automagic guessing various things from
environment, namely locales.

Use enconv online using onworks.net services



Latest Linux & Windows online programs