File python-ocrmypdf.changes of Package python-ocrmypdf
-------------------------------------------------------------------
Wed Nov 6 14:57:33 UTC 2024 - Matej Cepl <mcepl@cepl.eu>
- Update to 16.6.0:
- Fixed an issue where damaged PDFs would fail with --redo-ocr.
:issue:`1403`
- Fixed an error that prevented JBIG2 optimization on Windows
if the image was optimized in an earlier step. :issue:`1396`
- Fixed an error detecting the version of unpaper 7.0.0.
:issue:`1409`
- Fixed a performance regression when scanning pages.
:issue:`1378`. Thanks @aliemjay.
- Fixed Alpine Docker image by enforcing Alpine 3.19. Alpine
3.20 includes a defective version of Tesseract OCR and so is
not usable.
- Upgraded Ubuntu Docker image to use Ubuntu 24.04.
- Build and test scripts/actions switched to uv.
- When running in a container, we now remind the user that
temporary folders are inside the container and may not be
accessible.
- Fixed Linux test coverage matrix, which was missing some key
versions.
- Update to 16.5.0:
- Fixed issue with interpreting PDFs that have images with
array masks. :issue:`1377`
- Enabled testing on Python 3.13.
- Fixed a test that did not work correctly but still passed.
:issue:`1382`
- Improved "PDF/A conversion failed" warning message to better
describe implications.
- Updated documentation to better explain OCR_JSON_SETTINGS in
batch processing.
- Build backend changed from setuptools to hatchling.
- Update to 16.4.3:
- Work around pdfminer.six issue where a token on the buffer
boundary is incorrectly parsed as two tokens. :issue:`1361`
- New rules are applied to stencil masks and explicit masks
when calculating the optimal page DPI for rendering.
:issue:`1362`
- Fixed attempts to use an incompatible jbig2.EXE provided by
TeX Live. :issue:`1363`
- Update to 16.4.2:
- Fixed order of filenames passed to Ghostscript for PDF/A
generation. :issue:`1359`
- Suppressed missing jbig2dec warning message. :issue:`1358`
- Fixed calculation of image size when soft mask dimensions
don't match image dimension. :issue:`1351`
- Several fixes to documentation. Thanks to users Iris and
JoKalliauer who contributed these changes.
- Fixed error on processing PDFs that are missing certain image
metadata. :issue:`1315`
- Update to 16.4.1:
- Fixed calculation of image printed area (used in finding
weighted DPI for OCR). :issue:`1334`
- Fixed "NotImplementedError: not sure how to get colorspace"
error messages in logs which simply records a failure
to optimize images with print production colorspaces.
:issue:`1315`
- Update to 16.4.0:
- Selecting the osd and equ pseudo-languages with -l/--language
now exits with an error when using Tesseract OCR, because
these are not regular Tesseract languages but implementation
details implemented. Using them can cause Tesseract to crash.
- The hOCR renderer is more tolerant of extra whitespace in
input files.
- watcher.py now changes the output file extension to .pdf when
the input is not .pdf.
- Improved handling of PDFs that contain circularly referenced
Form XObjects. :issue:`1321`
- Fixed Alpine Docker image for ARM64, which was not building
correctly.
- Docker images now use pikepdf 9.0.0.
- Prevent use of Tesseract OCR 5.4.0, a version with known
regressions.
- Disabled progressbar for "Linearizing" when --no-progress-bar
set.
- Fixed some tests that warn about missing JBIG2 decoding via
pikepdf, by installing the necessary libraries during tests.
- Update to 16.3.1:
- Fixed a test suite failure with Ghostscript 10.03.0+.
:issue:`1316`
- Fixed an issue with the presentation of the "OCR" progress
bar. :issue:`1313`
- Update to 16.3.0:
- Fixed progress bar not displaying for Ghostscript PDF/A
conversion. :issue:`1313`
- Added progress bar for linearization. :issue:`1313`
- If --rotate-pages-threshold issued without --rotate-pages we
now exit with an error since the user likely intended to use
--rotate-pages. :issue:`1309`
- If Tesseract hOCR gives an invalid line box, print an error
message instead of exiting with an error. :issue:`1312`
- Update to 16.2.0:
- Fixed issue 'NoneType' object has no attribute 'get' when
optimizing certain PDFs. :issue:`1293,1271`
- Switched formatting from black to ruff.
- Added support for sending sidecar output to io.BytesIO.
- Added support for converting HEIF/HEIC images (the native
image of iPhones and some other devices) to PDFs, when the
appropriate pi-hief library is installed. This library is
marked as a dependency, but maintainers may opt out if
needed.
- We now default to downsampling large images that would
exceed Tesseract's internal limits, but only if it cause
processing to fail. Previously, this behavior only occurred
if specifically requested on command line. It can still be
configured and disabled. See the --tesseract command line
options.
- Added Macports install instructions. Thanks @akierig.
- Improved logging output when an unexpected error occurs while
trying to obtain the version of a third party program.
- Update to 16.1.2:
- Fixed test suite failure when using Ghostscript 10.3.
- Other minor corrections.
- Update to 16.1.1:
- Fixed PyPy 3.10 support.
- Update to 16.1.0:
- Improved hOCR renderer is now default for left to right
languages.
- Improved handling of rotated pages. Previously, OCR text
might be missing for pages that were rotated with a /Rotate
tag on the page entry.
- Improved handling of cropped pages. Previously, in some
cases a page with a crop box would not have its OCR applied
correctly and misalignment between OCR text and visible text
coudl occur.
- Documentation improvements, especially installation
instructions for less common platforms.
-------------------------------------------------------------------
Mon Jan 8 15:26:44 UTC 2024 - ecsos <ecsos@opensuse.org>
- Update to 16.0.4
- Fixed some issues for left-to-right text with the new hOCR renderer.
It is still not default yet but will be made so soon.
Right-to-left text is still in progress.
- Added an error to prevent use of several versions of Ghostscript
that seem corrupt existing text in input PDFs.
Newly generated OCR is not affected.
For best results, use Ghostscript 10.02.1 or newer,
which contains the fix for the issue.
-------------------------------------------------------------------
Thu Jan 4 10:05:05 UTC 2024 - ecsos <ecsos@opensuse.org>
- Update to 16.0.3
- Changed minimum required Ghostscript to 9.54, to support users of RHEL 9 and its derivatives,
since that is the latest version available there.
- Removed warning message about CVE-2023-43115, on the assumption that most distributions have backported the patch by now.
- Changes from 16.0.2
- Temporarily changed PDF text renderer back to sandwich by default to address regressions in macOS Preview.
- Changes from 16.0.1
- Fixed text rendering issue with new hOCR text renderer - extraneous byte order marks.
- Tightened dependencies.
- Changes from 16.0.0
- Added OCR text renderer, combined the best ideas of Tesseract's PDF generator and the older hOCR transformer renderer.
The result is a hopefully permanent fix for wordssmushedtogetherwithoutspaces issues in extracted text, better
registration/position of text on skewed baselines :issue:`1009`, fixes to character output when the German Fraktur script
is used :issue:`1191`, proper rendering of right to left languages (Arabic, Hebrew, Persian) :issue:`1157`.
Asian languages may still have excessive word breaks compared to expectations. The new renderer is the default;
the old sandwich renderer is still available using --pdf-renderer sandwich; the old hOCR renderer is no more.
- The ocrmypdf.hocrtransform API has changed substantially.
- Support for Python 3.9 has been dropped. Python 3.10+ is now required.
- pikepdf >= 8.8.0 is now required.
-------------------------------------------------------------------
Fri Dec 15 08:32:05 UTC 2023 - ecsos <ecsos@opensuse.org>
- Initial version 15.4.4