pyfcstm.utils.decode
Automatic text decoding utilities with a focus on Chinese encodings.
This module provides helpers for decoding byte sequences by trying a series of likely encodings. It is designed to work well with Windows-centric Chinese encodings while still supporting Unicode variants. The decoding strategy attempts multiple encodings in a defined order and returns the first successful result.
The module contains the following public components:
windows_chinese_encodings- Ordered list of common Chinese encodingsauto_decode()- Robust decoding function with auto-detection
Note
This module relies on chardet for probabilistic encoding detection.
Example:
>>> from pyfcstm.utils.decode import auto_decode
>>> text_bytes = b'\xc4\xe3\xba\xc3' # "你好" in GBK encoding
>>> auto_decode(text_bytes)
'你好'
windows_chinese_encodings
- pyfcstm.utils.decode.windows_chinese_encodings = ['utf-8', 'gbk', 'gb2312', 'gb18030', 'big5', 'cp936', 'cp950', 'hz', 'euc-cn', 'utf-16', 'utf-16-le', 'utf-16-be', 'utf-32', 'utf-32-le', 'utf-32-be']
Built-in mutable sequence.
If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.
auto_decode
- pyfcstm.utils.decode.auto_decode(data: bytes | bytearray) str[source]
Automatically decode bytes by trying multiple encodings.
The decoding order depends on the input length:
For inputs with length >= 30, the order is: 1) encoding detected by
chardet2) entries inwindows_chinese_encodings3) system default encodingFor shorter inputs, the order is: 1) entries in
windows_chinese_encodings2) system default encoding 3) encoding detected bychardet
The function tries each encoding until one succeeds. If all attempts fail, it raises the
UnicodeDecodeErrorthat progressed furthest (i.e., the error with the higheststartposition).- Parameters:
data (Union[bytes, bytearray]) – The bytes data to decode.
- Returns:
The decoded string.
- Return type:
str
- Raises:
UnicodeDecodeError – If decoding fails for all attempted encodings.
Example:
>>> text_bytes = b'\xc4\xe3\xba\xc3' # "你好" in GBK encoding >>> auto_decode(text_bytes) '你好'