pyfcstm.utils.decode

Automatic text decoding utilities with a focus on Chinese encodings.

This module provides helpers for decoding byte sequences by trying a series of likely encodings. It is designed to work well with Windows-centric Chinese encodings while still supporting Unicode variants. The decoding strategy attempts multiple encodings in a defined order and returns the first successful result.

The module contains the following public components:

windows_chinese_encodings - Ordered list of common Chinese encodings
auto_decode() - Robust decoding function with auto-detection

Note

This module relies on chardet for probabilistic encoding detection.

Example:

>>> from pyfcstm.utils.decode import auto_decode
>>> text_bytes = b'\xc4\xe3\xba\xc3'  # "你好" in GBK encoding
>>> auto_decode(text_bytes)
'你好'

windows_chinese_encodings

pyfcstm.utils.decode.windows_chinese_encodings = ['utf-8', 'gbk', 'gb2312', 'gb18030', 'big5', 'cp936', 'cp950', 'hz', 'euc-cn', 'utf-16', 'utf-16-le', 'utf-16-be', 'utf-32', 'utf-32-le', 'utf-32-be']

Built-in mutable sequence.

If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.

auto_decode

pyfcstm.utils.decode.auto_decode(data: bytes | bytearray) → str[source]

Automatically decode bytes by trying multiple encodings.

The decoding order depends on the input length:

For inputs with length >= 30, the order is: 1) encoding detected by chardet 2) entries in windows_chinese_encodings 3) system default encoding
For shorter inputs, the order is: 1) entries in windows_chinese_encodings 2) system default encoding 3) encoding detected by chardet

The function tries each encoding until one succeeds. If all attempts fail, it raises the UnicodeDecodeError that progressed furthest (i.e., the error with the highest start position).

Parameters:: data (Union[bytes, bytearray]) – The bytes data to decode.
Returns:: The decoded string.
Return type:: str
Raises:: UnicodeDecodeError – If decoding fails for all attempted encodings.

Example:

>>> text_bytes = b'\xc4\xe3\xba\xc3'  # "你好" in GBK encoding
>>> auto_decode(text_bytes)
'你好'