pyfcstm.utils.decode

Utilities for automatic text decoding with a focus on Chinese encodings.

This module provides functionality to automatically detect and decode text data from various encodings, with special emphasis on Chinese character encodings. It includes a comprehensive list of Chinese encodings commonly used on Windows systems and a robust auto-detection mechanism that tries multiple encodings until successful decoding is achieved.

auto_decode

pyfcstm.utils.decode.auto_decode(data: bytes | bytearray) → str[source]

Automatically decode bytes data by trying multiple encodings.

This function attempts to decode the input data using multiple encodings in the following order:

The encoding detected by chardet
Common Chinese encodings used in Windows
The default system encoding

The function tries each encoding until successful decoding is achieved. If all encodings fail, it raises the UnicodeDecodeError from the encoding that managed to decode the most characters before failing.

Parameters:: data (Union[bytes, bytearray]) – The bytes data to decode
Returns:: The decoded string
Return type:: str
Raises:: UnicodeDecodeError – If the data cannot be decoded with any of the attempted encodings

Example:

>>> text_bytes = b'\xc4\xe3\xba\xc3'  # "你好" in GBK encoding
>>> auto_decode(text_bytes)
'你好'