pyfcstm.utils.decode

Automatic text decoding utilities with a focus on Chinese encodings.

This module provides helpers for decoding byte sequences by trying a series of likely encodings. It is designed to work well with Windows-centric Chinese encodings while still supporting Unicode variants. The decoding strategy attempts multiple encodings in a defined order and returns the first successful result.

The module contains the following public components:

Note

This module relies on chardet for probabilistic encoding detection.

Example:

>>> from pyfcstm.utils.decode import auto_decode
>>> text_bytes = b'\xc4\xe3\xba\xc3'  # "你好" in GBK encoding
>>> auto_decode(text_bytes)
'你好'

windows_chinese_encodings

pyfcstm.utils.decode.windows_chinese_encodings = ['utf-8', 'gbk', 'gb2312', 'gb18030', 'big5', 'cp936', 'cp950', 'hz', 'euc-cn', 'utf-16', 'utf-16-le', 'utf-16-be', 'utf-32', 'utf-32-le', 'utf-32-be']

Built-in mutable sequence.

If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.

auto_decode

pyfcstm.utils.decode.auto_decode(data: bytes | bytearray) str[source]

Automatically decode bytes by trying multiple encodings.

The decoding order depends on the input length:

  • For inputs with length >= 30, the order is: 1) encoding detected by chardet 2) entries in windows_chinese_encodings 3) system default encoding

  • For shorter inputs, the order is: 1) entries in windows_chinese_encodings 2) system default encoding 3) encoding detected by chardet

The function tries each encoding until one succeeds. If all attempts fail, it raises the UnicodeDecodeError that progressed furthest (i.e., the error with the highest start position).

Parameters:

data (Union[bytes, bytearray]) – The bytes data to decode.

Returns:

The decoded string.

Return type:

str

Raises:

UnicodeDecodeError – If decoding fails for all attempted encodings.

Example:

>>> text_bytes = b'\xc4\xe3\xba\xc3'  # "你好" in GBK encoding
>>> auto_decode(text_bytes)
'你好'