File python-chardet.changes of Package python-chardet
-------------------------------------------------------------------
Fri Mar 6 07:41:56 UTC 2026 - Matej Cepl <mcepl@cepl.eu>
- update to 6.0.0 (the last version before the infringement;
DON’T UPGRADE UNTIL gh#chardet/chardet#327 IS RESOLVED):
- Features
- Unified single-byte charset detection: Instead of only
having trained language models for a handful of languages
(Bulgarian, Greek, Hebrew, Hungarian, Russian, Thai,
Turkish) and relying on special-case Latin1Prober and
MacRomanProber heuristics for Western encodings, chardet
now treats all single-byte charsets the same way: every
encoding gets proper language-specific bigram models
trained on CulturaX corpus data. This means chardet can now
accurately detect both the encoding and the language for
all supported single-byte encodings.
- 38 new languages: Arabic, Belarusian, Breton, Croatian,
Czech, Danish, Dutch, English, Esperanto, Estonian, Farsi,
Finnish, French, German, Icelandic, Indonesian, Irish,
Italian, Kazakh, Latvian, Lithuanian, Macedonian, Malay,
Maltese, Norwegian, Polish, Portuguese, Romanian, Scottish
Gaelic, Serbian, Slovak, Slovene, Spanish, Swedish, Tajik,
Ukrainian, Vietnamese, and Welsh. Existing models for
Bulgarian, Greek, Hebrew, Hungarian, Russian, Thai, and
Turkish were also retrained with the new pipeline.
- EncodingEra filtering: New encoding_era parameter to detect
allows filtering by an EncodingEra flag enum (MODERN_WEB,
LEGACY_ISO, LEGACY_MAC, LEGACY_REGIONAL, DOS, MAINFRAME,
ALL) allows callers to restrict detection to encodings from
a specific era. detect() and detect_all() default to
MODERN_WEB. The new MODERN_WEB default should drastically
improve accuracy for users who are not working with legacy
data. The tiers are:
MODERN_WEB: UTF-8/16/32, Windows-125x, CP874, CJK
multi-byte (widely used on the web)
LEGACY_ISO: ISO-8859-x, KOI8-R/U (legacy but well-known
standards)
LEGACY_MAC: Mac-specific encodings (MacRoman,
MacCyrillic, etc.)
LEGACY_REGIONAL: Uncommon regional/national encodings
(KOI8-T, KZ1048, CP1006, etc.)
DOS: DOS/OEM code pages (CP437, CP850, CP866, etc.)
MAINFRAME: EBCDIC variants (CP037, CP500, etc.)
- --encoding-era CLI flag: The chardetect CLI now accepts
-e/--encoding-era to control which encoding eras are
considered during detection.
- max_bytes and chunk_size parameters: detect(),
detect_all(), and UniversalDetector now accept max_bytes
(default 200KB) and chunk_size (default 64KB) parameters
for controlling how much data is examined. (#314, @bysiber)
- Encoding era preference tie-breaking: When multiple
encodings have very close confidence scores, the detector
now prefers more modern/Unicode encodings over legacy ones.
- Charset metadata registry: New chardet.metadata.charsets
module provides structured metadata about all supported
encodings, including their era classification and language
filter.
- should_rename_legacy now defaults intelligently: When set
to None (the new default), legacy renaming is automatically
enabled when encoding_era is MODERN_WEB.
- Direct GB18030 support: Replaced the redundant GB2312
prober with a proper GB18030 prober.
- EBCDIC detection: Added CP037 and CP500 EBCDIC model
registrations for mainframe encoding detection.
- Binary file detection: Added basic binary file detection to
abort analysis earlier on non-text files.
- Python 3.12, 3.13, and 3.14 support (#283, @hugovk; #311)
- GitHub Codespace support (#312, @oxygen-dioxide)
- Fixes
- Fix CP949 state machine: Corrected the state machine for
Korean CP949 encoding detection. (#268, @nenw)
- Fix SJIS distribution analysis: Fixed
SJISDistributionAnalysis discarding valid second-byte range
>= 0x80. (#315, @bysiber)
- Fix UTF-16/32 detection for non-ASCII-heavy text: Improved
detection of UTF-16/32 encoded CJK and other non-ASCII text
by adding a MIN_RATIO threshold alongside the existing
EXPECTED_RATIO.
- Fix get_charset crash: Resolved a crash when looking up
unknown charset names.
- Fix GB18030 char_len_table: Corrected the character length
table for GB18030 multi-byte sequences.
- Fix UTF-8 state machine: Updated to be more spec-compliant.
- Fix detect_all() returning inactive probers: Results from
probers that determined "definitely not this encoding" are
now excluded.
- Fix early cutoff bug: Resolved an issue where detection
could terminate prematurely.
- Default UTF-8 fallback: If UTF-8 has not been ruled out and
nothing else is above the minimum threshold, UTF-8 is now
returned as the default.
- Breaking changes
- Dropped Python 3.7, 3.8, and 3.9 support: Now requires
Python 3.10+. (#283, @hugovk)
- Removed Latin1Prober and MacRomanProber: These special-case
probers have been replaced by the unified model-based
approach described above. Latin-1, MacRoman, and all other
single-byte encodings are now detected by
SingleByteCharSetProber with trained language models,
giving better accuracy and language identification.
- Removed EUC-TW support: EUC-TW encoding detection has been
removed as it is extremely rare in practice.
- LanguageFilter.NONE removed: Use specific language filters
or LanguageFilter.ALL instead.
- Enum types changed: InputState, ProbingState, MachineState,
SequenceLikelihood, and CharacterCategory are now IntEnum
(previously plain classes or Enum). LanguageFilter values
changed from hardcoded hex to auto().
- detect() default behavior change: detect() now defaults to
encoding_era=EncodingEra.MODERN_WEB and
should_rename_legacy=None (auto-enabled for MODERN_WEB),
whereas previously it defaulted to considering all
encodings with no legacy renaming.
- Misc changes
- Switched from Poetry/setuptools to uv + hatchling: Build
system modernized with hatch-vcs for version management.
- License text updated: Updated LGPLv2.1 license text and FSF
notices to use URL instead of mailing address. (#304, #307,
@musicinmybrain)
- CulturaX-based model training: The create_language_model.py
training script was rewritten to use the CulturaX
multilingual corpus instead of Wikipedia, producing higher
quality bigram frequency models.
- Language class converted to frozen dataclass: The language
metadata class now uses @dataclass(frozen=True) with
num_training_docs and num_training_chars fields replacing
wiki_start_pages.
- Test infrastructure: Added pytest-timeout and pytest-xdist
for faster parallel test execution. Reorganized test data
directories.
-------------------------------------------------------------------
Mon Sep 4 16:04:35 UTC 2023 - Dirk Müller <dmueller@suse.com>
- update to 5.2.0:
* Adds support for running chardet CLI via `python -m chardet`
-------------------------------------------------------------------
Fri Apr 21 12:23:15 UTC 2023 - Dirk Müller <dmueller@suse.com>
- add sle15_python_module_pythons (jsc#PED-68)
-------------------------------------------------------------------
Thu Apr 13 22:40:29 UTC 2023 - Matej Cepl <mcepl@suse.com>
- Make calling of %{sle15modernpython} optional.
-------------------------------------------------------------------
Mon Jan 16 21:13:18 UTC 2023 - Dirk Müller <dmueller@suse.com>
- skip python 3.6 builds
-------------------------------------------------------------------
Mon Jan 2 18:40:26 UTC 2023 - Dirk Müller <dmueller@suse.com>
- update to 5.1.0:
* Add should_rename_legacy argument to most functions, which will rename
older encodings to their more modern equivalents (e.g., GB2312 becomes
GB18030) (#264, @dan-blanchard)
* Add capital letter sharp S and ISO-8859-15 support
* Add a prober for MacRoman encoding
* Add --minimal flag to chardetect command
* Add type annotations to the project and run mypy on CI
* Add support for Python 3.11
* Clarify LGPL version in License trove classifier (#255, @musicinmybrain)
* Remove support for EOL Python 3.6 (#260, @jdufresne)
* Remove unnecessary guards for non-falsey values (#259, @jdufresne)
* Switch to Python 3.10 release in GitHub actions (#257, @jdufresne)
* Remove setup.py in favor of build package (#262, @jdufresne)
* Run tests on macos, Windows, and 3.11-dev (#267, @dan-blanchard)
-------------------------------------------------------------------
Tue Jul 5 13:21:09 UTC 2022 - Ben Greiner <code@bnavigator.de>
- Update to 5.0.0
* This release is the first release of chardet that no longer
supports Python < 3.6
* Added a prober for Johab Korean (#207, @grizlupo)
* Added a prober for UTF-16/32 BE/LE (#109, #206, @jpz)
* Added test data for Croatian, Czech, Hungarian, Polish, Slovak,
Slovene, Greek, and Turkish, which should help prevent future
errors with those languages
* Improved XML tag filtering, which should improve accuracy for
XML files (#208)
* Tweaked SingleByteCharSetProber confidence to match latest
uchardet (#209)
* Made detect_all return child prober confidences (#210)
* Updated examples in docs (#223, @domdfcoding)
* Documentation fixes (#212, #224, #225, #226, #220, #221, #244
from too many to mention)
* Minor performance improvements (#252, @deedy5)
* Add support for Python 3.10 when testing (#232, @jdufresne)
* Lots of little development cycle improvements, mostly thanks to
@jdufresne
- Canonicalize alternatives creation
-------------------------------------------------------------------
Fri Dec 10 09:05:04 UTC 2021 - pgajdos@suse.com
- pytest-runner is not required for build
-------------------------------------------------------------------
Thu Sep 30 08:18:47 UTC 2021 - Stefan Schubert <schubi@suse.de>
- Use libalternatives instead of update-alternatives.
-------------------------------------------------------------------
Sun Dec 20 05:52:28 UTC 2020 - John Vandenberg <jayvdb@gmail.com>
- Remove now unnecessary pytest4.patch and python-chardet-rpmlintrc
- Update to v4.0.0
See https://github.com/chardet/chardet/compare/3.0.4...4.0.0
-------------------------------------------------------------------
Mon Oct 14 11:45:00 UTC 2019 - Matej Cepl <mcepl@suse.com>
- Replace %fdupes -s with plain %fdupes; hardlinks are better.
-------------------------------------------------------------------
Wed Jul 3 08:32:17 UTC 2019 - Tomáš Chvátal <tchvatal@suse.com>
- Add patch to fix build with pytest4:
* pytest4.patch
-------------------------------------------------------------------
Tue Feb 26 08:14:25 UTC 2019 - Tomáš Chvátal <tchvatal@suse.com>
- Switch to multibuild to avoid buildcycles
-------------------------------------------------------------------
Tue Dec 4 12:49:12 UTC 2018 - Matej Cepl <mcepl@suse.com>
- Remove superfluous devel dependency for noarch package
-------------------------------------------------------------------
Tue May 15 07:02:02 UTC 2018 - antoine.belvire@opensuse.org
- Fix update-alternatives call in %postun.
-------------------------------------------------------------------
Wed Sep 20 21:47:30 UTC 2017 - dmueller@suse.com
- add update-alternatives post-requires
-------------------------------------------------------------------
Fri Aug 25 13:09:48 UTC 2017 - tbechtold@suse.com
- Fix build for Leap-42.3
-------------------------------------------------------------------
Tue Aug 15 09:57:21 UTC 2017 - dmueller@suse.com
- add update-alternative support for py2/py3 coinstallability
-------------------------------------------------------------------
Thu Jun 29 08:43:41 UTC 2017 - ecsos@opensuse.org
- fix source link
-------------------------------------------------------------------
Sat Jun 10 08:39:04 UTC 2017 - dmueller@suse.com
- update to 3.0.4
-------------------------------------------------------------------
Tue Mar 21 13:57:55 UTC 2017 - jmatejek@suse.com
- do not use %py_ver, replace with %python_version
-------------------------------------------------------------------
Sun Mar 19 08:23:54 UTC 2017 - aloisio@gmx.com
- Converted to single spec.
-------------------------------------------------------------------
Mon Jan 30 21:41:47 UTC 2017 - rjschwei@suse.com
- Include in SLE 12 (bsc#1002895, FATE#321630)
-------------------------------------------------------------------
Mon May 11 05:49:58 UTC 2015 - arun@gmx.de
- specfile:
* added update alternative to prevent conflicts with python3 version
* add tests
-------------------------------------------------------------------
Tue Feb 10 23:45:01 UTC 2015 - aloisio@gmx.com
- Update to version 2.3.0
* Added support for CP932 detection (thanks to @hashy)
* Fixed an issue where UTF-8 with a BOM would not be detected
as UTF-8-SIG (#8)
* Modified chardetect to use argparse for argument parsing
* Moved docs to a gh-pages branch. You can now access them
at http://chardet.github.io
- Changelog on https://github.com/chardet/chardet/commits/2.3.0
- Other minor changes
-------------------------------------------------------------------
Thu Oct 24 11:00:03 UTC 2013 - speilicke@suse.com
- Require python-setuptools instead of distribute (upstreams merged)
-------------------------------------------------------------------
Tue Oct 2 03:09:41 UTC 2012 - alexandre@exatati.com.br
- Update to 2.1.1:
- Sorry, no changelog.
-------------------------------------------------------------------
Fri Jul 27 15:10:36 UTC 2012 - alexandre@exatati.com.br
- Update to 1.1:
- Sorry, no changelog.
-------------------------------------------------------------------
Wed Dec 28 22:18:53 UTC 2011 - alexandre@exatati.com.br
- Standard in spec file;
- Remove CFLAGS and %clean section from spec file.
-------------------------------------------------------------------
Thu Dec 8 11:12:41 UTC 2011 - coolo@suse.com
- the license seems to be LGPL-2.1+
-------------------------------------------------------------------
Sat Mar 26 02:11:35 UTC 2011 - alexandre@exatati.com.br
- Regenerate spec file with py2pack;
- Bzip2 source file.
-------------------------------------------------------------------
Mon Jan 25 14:35:35 UTC 2010 - alexandre@exatati.com.br
- Initial package (2.0.1) for openSUSE.