🌏 CharsetDetect: Guess character encoding
CharsetDetect is a thin wrapper around the chardetng crate.
Usage
Guess the encoding of a string:
iex> File.read!("test/assets/windows1252.txt") |> CharsetDetect.guess
{:ok, "windows-1252"}
iex> File.read!("test/assets/sjis.txt") |> CharsetDetect.guess
{:ok, "Shift_JIS"}
iex> File.read!("test/assets/utf8.txt") |> CharsetDetect.guess
{:ok, "UTF-8"}iex> File.read!("test/assets/big5.txt") |> CharsetDetect.guess!
"Big5"You might consider minimizing additional memory consumption.
iex> "... (long text) ..." |> String.slice(0, 1024) |> CharsetDetect.guessIf you pass a non-string value, it will return an error.
iex> 42 |> CharsetDetect.guess
{:error, "invalid argument"}[!NOTE] An ASCII string, including an empty string, will result in a
UTF-8encoding rather thanASCII.
iex> "hello world" |> CharsetDetect.guess
{:ok, "UTF-8"}
iex> "" |> CharsetDetect.guess
{:ok, "UTF-8"}Security considerations
Starting from chardetng 1.0, the detector takes two options that control
which encodings can be returned: Iso2022JpDetection and Utf8Detection.
This library calls chardetng with both set to Allow, which matches the
behavior of the 0.1.x series and keeps detection of legacy Japanese content
(e.g. ISO-2022-JP email) working out of the box.
This is the permissive profile. chardetng's upstream documentation notes
that Deny is appropriate in some browser-like or web-content contexts; the
exact conditions differ between the two options, so please consult the
upstream docs for the precise trade-offs:
These options are not configurable through this library at present — the
detector is always invoked with both set to Allow.
Strategies for implementing a conversion function
You can achieve conversion to any desired encoding using iconv.
defmodule Converter do
@spec convert(binary, String.t()) :: {:ok, binary} | {:error, String.t()}
def convert(text, to_encoding \\ "UTF-8") when is_binary(text) do
case text |> String.slice(0, 1024) |> CharsetDetect.guess do
{:ok, ^to_encoding} ->
{:ok, text}
{:ok, encoding} ->
try do
{:ok, :iconv.convert(encoding, to_encoding, text)}
rescue
e in ArgumentError -> {:error, inspect(e)}
end
{:error, reason} ->
{:error, reason}
end
end
def convert(_, _), do: {:error, "not a string"}
endiex> File.read!("test/assets/big5.txt") |> Converter.convert
{:ok, "大五碼是繁体中文(正體中文)社群最常用的電腦漢字字符集標準。\n"} # UTF-8Installation
The package can be installed by adding charset_detect to your list of dependencies in mix.exs:
def deps do
[
{:charset_detect, "~> 0.2.0"}
]
end
Then, run mix deps.get.
Development
Prerequisites
[!NOTE] This library requires the Rust Toolchain for compilation.
Follow the instructions at www.rust-lang.org/tools/install to install Rust.
Verify the installation by checking the cargo command version:
cargo --version
# Should output something like: cargo 1.82.0 (8f40fc59f 2024-08-21)
Then, set the RUSTLER_PRECOMPILATION_EXAMPLE_BUILD environment variable to ensure that local sources are compiled instead of downloading a precompiled library file.
RUSTLER_PRECOMPILATION_EXAMPLE_BUILD=1 mix compileLicense
The MIT License