pyfcstm.utils.decode

Utilities for automatic text decoding with a focus on Chinese encodings.

This module provides functionality to automatically detect and decode text data from various encodings, with special emphasis on Chinese character encodings. It includes a comprehensive list of Chinese encodings commonly used on Windows systems and a robust auto-detection mechanism that tries multiple encodings until successful decoding is achieved.

auto_decode

pyfcstm.utils.decode.auto_decode(data: bytes | bytearray) str[source]

Automatically decode bytes data by trying multiple encodings.

This function attempts to decode the input data using multiple encodings in the following order:

  1. The encoding detected by chardet

  2. Common Chinese encodings used in Windows

  3. The default system encoding

The function tries each encoding until successful decoding is achieved. If all encodings fail, it raises the UnicodeDecodeError from the encoding that managed to decode the most characters before failing.

Parameters:

data (Union[bytes, bytearray]) – The bytes data to decode

Returns:

The decoded string

Return type:

str

Raises:

UnicodeDecodeError – If the data cannot be decoded with any of the attempted encodings

Example:

>>> text_bytes = b'\xc4\xe3\xba\xc3'  # "你好" in GBK encoding
>>> auto_decode(text_bytes)
'你好'