File OCRmyPDF.changes of Package OCRmyPDF

-------------------------------------------------------------------
Mon Nov 11 17:46:14 UTC 2024 - Frank Kunz <mailinglists@kunz-im-inter.net>

- Update to version 16.6.1
  v16.6.1
    Fixed some issues with Docker build, such as removing unnecessary content and using a stable Tesseract version.
    Reverted Docker image to Ubuntu 22.04 to access older/more stable Ghostscript for now.
    Clarified batch commands in documentation.
    Fixed an issue with JSON serialization and pickling of HOCRResult. :issue:`1427`
  v16.6.0
    Fixed an issue where damaged PDFs would fail with --redo-ocr. :issue:`1403`
    Fixed an error that prevented JBIG2 optimization on Windows if the image was optimized in an earlier step. :issue:`1396`
    Fixed an error detecting the version of unpaper 7.0.0. :issue:`1409`
    Fixed a performance regression when scanning pages. :issue:`1378`. Thanks @aliemjay.
    Fixed Alpine Docker image by enforcing Alpine 3.19. Alpine 3.20 includes a defective version of Tesseract OCR and so is not usable.
    Upgraded Ubuntu Docker image to use Ubuntu 24.04.
    Build and test scripts/actions switched to uv.
    When running in a container, we now remind the user that temporary folders are inside the container and may not be accessible.
    Fixed Linux test coverage matrix, which was missing some key versions.
  v16.5.0
    Fixed issue with interpreting PDFs that have images with array masks. :issue:`1377`
    Enabled testing on Python 3.13.
    Fixed a test that did not work correctly but still passed. :issue:`1382`
    Improved "PDF/A conversion failed" warning message to better describe implications.
    Updated documentation to better explain OCR_JSON_SETTINGS in batch processing.
    Build backend changed from setuptools to hatchling.
  v16.4.3
    Work around pdfminer.six issue where a token on the buffer boundary is incorrectly parsed as two tokens. :issue:`1361`
    New rules are applied to stencil masks and explicit masks when calculating the optimal page DPI for rendering. :issue:`1362`
    Fixed attempts to use an incompatible jbig2.EXE provided by TeX Live. :issue:`1363`
  v16.4.2
    Fixed order of filenames passed to Ghostscript for PDF/A generation. :issue:`1359`
    Suppressed missing jbig2dec warning message. :issue:`1358`
    Fixed calculation of image size when soft mask dimensions don't match image dimension. :issue:`1351`
    Several fixes to documentation. Thanks to users Iris and JoKalliauer who contributed these changes.
    Fixed error on processing PDFs that are missing certain image metadata. :issue:`1315`
  v16.4.1
    Fixed calculation of image printed area (used in finding weighted DPI for OCR). :issue:`1334`
    Fixed "NotImplementedError: not sure how to get colorspace" error messages in logs which simply records a failure to optimize images with print production colorspaces. :issue:`1315`
  v16.4.0
    Selecting the osd and equ pseudo-languages with -l/--language now exits with an error when using Tesseract OCR, because these are not regular Tesseract languages but implementation details implemented. Using them can cause Tesseract to crash.
    The hOCR renderer is more tolerant of extra whitespace in input files.
    watcher.py now changes the output file extension to .pdf when the input is not .pdf.
    Improved handling of PDFs that contain circularly referenced Form XObjects. :issue:`1321`
    Fixed Alpine Docker image for ARM64, which was not building correctly.
    Docker images now use pikepdf 9.0.0.
    Prevent use of Tesseract OCR 5.4.0, a version with known regressions.
    Disabled progressbar for "Linearizing" when --no-progress-bar set.
    Fixed some tests that warn about missing JBIG2 decoding via pikepdf, by installing the necessary libraries during tests.
  v16.3.1
    Fixed a test suite failure with Ghostscript 10.03.0+. :issue:`1316`
    Fixed an issue with the presentation of the "OCR" progress bar. :issue:`1313`
  v16.3.0
    Fixed progress bar not displaying for Ghostscript PDF/A conversion. :issue:`1313`
    Added progress bar for linearization. :issue:`1313`
    If --rotate-pages-threshold issued without --rotate-pages we now exit with an error since the user likely intended to use --rotate-pages. :issue:`1309`
    If Tesseract hOCR gives an invalid line box, print an error message instead of exiting with an error. :issue:`1312`
  v16.2.0
    Fixed issue 'NoneType' object has no attribute 'get' when optimizing certain PDFs. :issue:`1293,1271`
    Switched formatting from black to ruff.
    Added support for sending sidecar output to io.BytesIO.
    Added support for converting HEIF/HEIC images (the native image of iPhones and some other devices) to PDFs, when the appropriate pi-hief library is installed. This library is marked as a dependency, but maintainers may opt out if needed.
    We now default to downsampling large images that would exceed Tesseract's internal limits, but only if it cause processing to fail. Previously, this behavior only occurred if specifically requested on command line. It can still be configured and disabled. See the --tesseract command line options.
    Added Macports install instructions. Thanks @akierig.
    Improved logging output when an unexpected error occurs while trying to obtain the version of a third party program.

-------------------------------------------------------------------
Sun Apr 14 09:42:17 UTC 2024 - Frank Kunz <mailinglists@kunz-im-inter.net>

- Update to version 16.1.2
  Remove 0001-Drop-shebang-from-non-executable-files.patch from build
  v16.1.2
    Fixed test suite failure when using Ghostscript 10.3.
    Other minor corrections.
  v16.1.1
    Fixed PyPy 3.10 support.
  v16.1.0
    Improved hOCR renderer is now default for left to right languages.
    Improved handling of rotated pages. Previously, OCR text might be missing for
    pages that were rotated with a /Rotate tag on the page entry.
    Improved handling of cropped pages. Previously, in some cases a page with a
    crop box would not have its OCR applied correctly and misalignment between
    OCR text and visible text coudl occur.
    Documentation improvements, especially installation instructions for less
    common platforms.
  v16.0.4
    Fixed some issues for left-to-right text with the new hOCR renderer. It is still
    not default yet but will be made so soon. Right-to-left text is still in progress.
    Added an error to prevent use of several versions of Ghostscript that seem
    corrupt existing text in input PDFs. Newly generated OCR is not affected.
    For best results, use Ghostscript 10.02.1 or newer, which contains the fix
    for the issue.
  v16.0.3
    Changed minimum required Ghostscript to 9.54, to support users of RHEL 9 and its
    derivatives, since that is the latest version available there.
    Removed warning message about CVE-2023-43115, on the assumption that most
    distributions have backported the patch by now.
  v16.0.2
    Temporarily changed PDF text renderer back to sandwich by default to address
    regressions in macOS Preview.
  v16.0.1
    Fixed text rendering issue with new hOCR text renderer - extraneous byte order
    marks.
    Tightened dependencies.
  v16.0.0
    Added OCR text renderer, combined the best ideas of Tesseract's PDF
    generator and the older hOCR transformer renderer. The result is a hopefully
    permanent fix for wordssmushedtogetherwithoutspaces issues in extracted text,
    better registration/position of text on skewed baselines :issue:`1009`,
    fixes to character output when the German Fraktur script is used :issue:`1191`,
    proper rendering of right to left languages (Arabic, Hebrew, Persian) :issue:`1157`.
    Asian languages may still have excessive word breaks compared to expectations.
    The new renderer is the default; the old sandwich renderer is still available
    using ``--pdf-renderer sandwich``; the old hOCR renderer is no more.
    The ``ocrmypdf.hocrtransform`` API has changed substantially.
    Support for Python 3.9 has been dropped. Python 3.10+ is now required.
    pikepdf >= 8.8.0 is now required.
  v15.4.4
    Fixed documentation for installing Ghostscript on Windows. :issue:`1198`
    Added warning message about security issue in older versions of Ghostscript.
  v15.4.3
    Fixed deprecation warning in pikepdf older than 8.7.1; pikepdf >= 8.7.1 is
    now required.
  v15.4.2
    We now raise an exception on a certain class of PDFs that likely need an
    explicit color conversion strategy selected to display correctly
    for PDF/A conversion.
    Fixed an error that occurred while trying to write a log message after the
    debug log handler was removed.
  v15.4.1
    Fixed misc/watcher.py regressions: accept ``--ocr-json-settings`` as either
    filename or JSON string, as previously; and argument count mismatch.
    :issue:`1183,1185`
    We no longer attempt to set /ProcSet in the PDF output, since this is an
    obsolete PDF feature.
    Documentation improvements.
  v15.4.0
    Added new experimental APIs to support offline editing of the final text.
    Specifically, one can now generate hOCR files with OCRmyPDF, edit them with
    some other tool, and then finalize the PDF. They are experimental and
    subject to change, including details of how the working folder is used.
    There is no command line interface.
    Code reorganization: executors, progress bars, initialization and setup.
    Fixed test coverage in cases where the coverage tool did not properly trace
    into threads or subprocesses. This code was still being tested but appeared
    as not covered.
    In the test suite, reduced use of subprocesses and other techniques that
    interfere with coverage measurement.
    Improved error check for when we appear to be running inside a snap container
    and files are not available.
    Plugin specification now properly defines progress bars as a protocol rather
    than defining them as "tqdm-like".
    We now default to using "forkserver" process creation on POSIX platforms
    rather than fork, since this is method is more robust and avoids some
    issues when threads are present.
    Fixed an instance where the user's request to ``--no-use-threads`` was ignored.
    If a PDF does not have language metadata on its top level object, we add
    the OCR language.
    Replace some cryptic test error messages with more helpful ones.
    Debug messages for how OCRmyPDF picks the colorspace for a page are now
    more descriptive.
  v15.3.1
    Fixed an issue with logging settings for misc/watcher.py introduced in the
    previous release. :issue:`1180`
    We now attempt to preserve the input's extended attributes when creating
    the output file.
    For some reason, the macOS build now needs OpenSSL explicitly installed.
    Updated documentation on Docker performance concerns.
  v15.3.0
    Update misc/watcher.py to improve command line interface using Typer, and
    support ``.env`` specification of environment variables. Improved error
    messages. Thanks to @mflagg2814 for the PR that prompted this improvement.
    Improved error message when a file cannot be read because we are running in
    a snap container.
  v15.2.0
    Added a Docker image based on Alpine Linux. This image is smaller than the
    Ubuntu-based image and may be useful in some situations. Currently hosted at
    jbarlow83/ocrmypdf-alpine. Currently not available in ARM flavor.
    The Ubuntu Docker is now aliased to jbarlow83/ocrmypdf-ubuntu.
    Updated Docker documentation.
  v15.1.0
    We now require Pillow 10.0.1, due a serious security vulnerability in all earlier
    versions of that dependency. The vulnerability concerns WebP images and could
    be triggered in OCRmyPDF when creating a PDF from a malicious WebP image.
    Added some keyword arguments to ``ocrmypdf.ocr`` that were previously accepted
    but undocumented.
    Documentation updates and typing improvements.

-------------------------------------------------------------------
Fri Sep 29 20:02:25 UTC 2023 - Frank Kunz <mailinglists@kunz-im-inter.net>

- Update to version 15.0.2
  Added Python 3.12 to test matrix.
  Updated documentation for notes on Python 3.12, 32-bit support and some new features in v15.

-------------------------------------------------------------------
Wed Sep 27 17:39:10 UTC 2023 - Frank Kunz <mailinglists@kunz-im-inter.net>

- Update to version 15.0.1
  v15.0.1
    Wheels Python tag changed to py39.
    Marked as a expected fail a test that fails on recent Ghostscript versions.
    Clarified documentation and release notes around the extent of 32-bit support.
    Updated installation documentation to changes in v15.
  v15.0.0
    Dropped support for Python 3.8.
    Dropped support many older dependencies - see pyproject.toml for details. Generally speaking, Ubuntu 22.04 is our baseline system.
    Dropped support for 32-bit Linux wheels. You must use a 64-bit operating system, and 64-bit versions of Python, Tesseract and Ghostscript to use OCRmyPDF. Many of our dependencies are dropping 32-bit builds (e.g. Pillow), and we are following suit. (Maintainers may still build 32-bit versions from source.)
    Changed to trusted release for PyPI publishing.
    pikepdf memory mapping is enabled again for improved performance, now an issue with pikepdf has been fixed.
    ocrmypdf.helpers.calculate_downsample previously had two variants, one that took a PIL.Image and one that took a tuple[int, int]. The latter was removed.
    The snap version of ocrmypdf is now based on Ubuntu core22.
    We now account situations where a small portion of an image on a page reports a high DPI (resolution). Previously, the entire page would be rasterized at the highest resolution, which caused performance problems. Now, the page is rasterized at a resolution based on the average DPI of the page, weighted by the area that each feature occupies. Typically, small areas of high resolution in PDFs are errors or quirks from the repeated use of assets and high resolution is not beneficial. :issue:`1010,1104,1004,1079,1010`
    Ghostscript color conversion strategy is now configurable. :issue:`1143`
  v14.4.0
    Digitally signed PDFs are now detected. If the PDF is signed, OCRmyPDF will refuse to modify it. Previously, only encrypted PDFs were detected, not those that were signed but not encrypted. :issue:`1040`
    In addition, --invalidate-digital-signatures can be used to override the above behavior and modify the PDF anyway. :issue:`1040`
    tqdm progress bars replaced with "rich" progress bars. The rich library is a new dependency. Certain APIs that used tqdm are now deprecated and will be removed in the next major release.
    Improved integration with GitHub Releases. Thanks to @stumpylog.

-------------------------------------------------------------------
Fri Jun 23 17:12:46 UTC 2023 - Frank Kunz <mailinglists@kunz-im-inter.net>

- Update to version 14.3.0
  Renamed master branch to main.
  Improve PDF rasterization accuracy by using the -dPDFSTOPONERROR option to Ghostscript.
    Use --continue-on-soft-render-error if you want to render the PDF anyway. The plugin
    specification was adjusted to support this feature; plugin authors may want to adapt
    PDF rasterizing and rendering plugins. :issue:`1083`
  The calculated deskew angle is now recorded in the logged output. :issue:`1101`
  Metadata can now be unset by setting a metadata type such as --title to an empty string. :issue:`1117,1059`
  Fixed random order of languages due to use of a set. This may have caused output to vary
    when multiple languages were set for OCR. :issue:`1113`
  Clarified the optimization ratio reported in the log output.
  Fixed :issue:`977`, where images inside Form XObjects were always excluded from image optimization.
  Added --tesseract-downsample-above to downsample larger images even when they do not exceed
    Tesseract's internal limits. This can be used to speed up OCR, possibly sacrificing accuracy.
  Fixed resampling AttributeError on older Pillow. :issue:`1096`
  Removed an error about using Ghostscript on PDFs with that have the /UserUnit feature in use.
    Previously, Ghostscript would fail to process these PDFs, but in all supported versions it
    is now supported, so the error is no longer needed.
  Improved documentation around installing other language packs for Tesseract.

-------------------------------------------------------------------
Thu Apr 20 14:38:18 UTC 2023 - Frank Kunz <mailinglists@kunz-im-inter.net>

- Update to version 14.1.0
  Added --tesseract-non-ocr-timeout. This allows using Tesseract's deskew and other non-OCR features while disabling OCR using --tesseract-timeout 0.
  Added --tesseract-downsample-large-images. This downsamples larges images that exceed the maximum image size Tesseract can handle. Large images may still take a long time to process, but this allows them to be processed if that is desired.
  Fixed :issue:`1082`, an issue with snap packaged building.
  Change linter to ruff, fix lint errors, update documentation.

-------------------------------------------------------------------
Thu Apr 13 18:43:12 UTC 2023 - Frank Kunz <mailinglists@kunz-im-inter.net>

- Update to version 14.0.4

-------------------------------------------------------------------
Tue Feb 14 18:30:21 UTC 2023 - Frank Kunz <mailinglists@kunz-im-inter.net>

- Update to version 14.0.2

-------------------------------------------------------------------
Tue May 24 09:28:03 UTC 2022 - Frank Kunz <mailinglists@kunz-im-inter.net>

- Update to version 13.4.4

-------------------------------------------------------------------
Wed Jul 29 04:38:12 UTC 2020 - Karl Cheng <qantas94heavy@gmail.com>

- Update to version 10.3.1 

-------------------------------------------------------------------
Tue Dec  5 13:59:32 UTC 2017 - t.gruner@katodev.de

- update to version 4.5.6

-------------------------------------------------------------------
Wed Mar 30 12:07:20 UTC 2016 - t.gruner@katodev.de

- update to version 4.0.7

-------------------------------------------------------------------
Wed Jan 20 14:20:13 UTC 2016 - t.gruner@katodev.de

- update to version 3.1.1

-------------------------------------------------------------------
Wed Feb  4 13:34:07 UTC 2015 - t.gruner@katodev.de

- update to version 2.2

-------------------------------------------------------------------
Sun Oct 12 21:08:30 UTC 2014 - t.gruner@katodev.de

- Add a view patches from Christoph

-------------------------------------------------------------------
Mon Sep 22 20:52:14 UTC 2014 - t.gruner@katodev.de

- update to version 2.1

-------------------------------------------------------------------
Sun Dec 29 19:30:00 UTC 2013 - soeren@ploennigs.net

- initial build
- added patch to use system dirs (/tmp, /usr/share/OCRmyPDF)
- moved jhove to separate package
openSUSE Build Service is sponsored by