File perl-String-REPartition.spec of Package perl-String-REPartition

Overview Repositories Revisions Requests Users Attributes Meta

File perl-String-REPartition.spec of Package perl-String-REPartition

#
# spec file for package perl-String-REPartition
#
# Copyright (c) 2019 SUSE LINUX GmbH, Nuernberg, Germany.
#
# All modifications and additions to the file contributed by third parties
# remain the property of their copyright owners, unless otherwise agreed
# upon. The license for this file, and modifications and additions to the
# file, is the same license as for the pristine package itself (unless the
# license for the pristine package is not an Open Source License, in which
# case the license is the MIT License). An "Open Source License" is a
# license that conforms to the Open Source Definition (Version 1.9)
# published by the Open Source Initiative.

# Please submit bugfixes or comments via https://bugs.opensuse.org/
#

Name:           perl-String-REPartition
Version:        1.6
Release:        0
%define cpan_name String-REPartition
Summary:        Generates a regex to partition a data set
License:        CHECK(Artistic-1.0 OR GPL-1.0-or-later)
Group:          Development/Libraries/Perl
Url:            https://metacpan.org/release/%{cpan_name}
Source0:        https://cpan.metacpan.org/authors/id/A/AV/AVIF/%{cpan_name}-%{version}.tar.gz
BuildArch:      noarch
BuildRoot:      %{_tmppath}/%{name}-%{version}-build
BuildRequires:  perl
BuildRequires:  perl-macros
BuildRequires:  perl(Test::Pod)
BuildRequires:  perl(Test::Pod::Coverage)
%{perl_requires}

%description
This module exports a single function -- make_partition_re. It takes as its
first argument a number between 0 and 1, representing a percentage, and as
its second argument a reference to a list of strings. It returns a regular
expression which is guaranteed to match the percentage of the strings in
the list represented by the number in the first argument. More importantly,
the regex returned will *not* to match the rest of the string in the list.
That is, if the inputs were '0.6' and a reference to a list of 100 strings,
the returned regex would match 60 of the strings in the list and not match
the other 40.

Keep in mind that, since only integer operations may be performed on these
strings, (that is, there cannot be a regex which matches a fraction of a
string), the target number is rounded down. If you have 4 strings in your
list and a ratio of 0.4, the resulting regex will match 1 string, not 1.6
strings. More interestingly, with 4 strings and a ratio of 0.1, the
resulting regex will almost certainly be '/^()$/' -- matching exactly 0 of
the strings in the list. Furthermore, because of this rounding, the
returned regex may not match precisely the number expected by multiplying
the size of the list by the ratio, but instead be off by a small number in
either direction.

c<make_partition_re()> will return c<undef> on a failure, and print a
warning to STDERR if '$^W' is true. Currently, the only errors that can
occur relate directly to the validity of the inputs. Furthermore, if the
strings in the list are not unique, the behaviour of this function is not
defined. For a small amout of repetition the regex should still work, but
it should be clear that a solution cannot be found if the input list
consists only of many copies of the same string.

The function finds its solution in roughly O(N) time -- however, in worst
cases, I think it can get as high as O(N^2). It's also true that certain
types of pathologically constructed data sets can break things and cause it
either to return an invalid regex or to enter an infinite loop. While I
haven't run into any of this in my testing, I'm not confident that I've
tested every possibility.

So why would you want to use this? Well, that's a question you'll have to
answer for yourself, mostly. :) However, the situations I envisaged while
developing the module were sort of like this: Imagine you have a large set
of data, indexed by a correspondingly large set of keywords. Let's say you
want to split this data into two partitions, perhaps in order to store the
data in two separate locations. Maintaining a complete list of the remote
keys could be expensive -- instead, you can simply store a regular
expression which matches the keys you keep remotely and does not match the
local ones.

Another interesting feature is that a regex generated from a sufficiently
large subset of your data will, approximately, match the appropriate
percentage of strings from the complete data set. This means that you do
not need to have all of the data before you generate a regex to partition
it. As an example, generating a regex from roughly 10% of the words in
/usr/dict/words (selected randomly) gave me a regex that matched within .3%
of the desired result for all of the words.

%prep
%setup -q -n %{cpan_name}-%{version}
find . -type f ! -name \*.pl -print0 | xargs -0 chmod 644

%build
PERL_USE_UNSAFE_INC=1 perl Makefile.PL INSTALLDIRS=vendor
make %{?_smp_mflags}

%check
make test

%install
%perl_make_install
%perl_process_packlist
%perl_gen_filelist

%files -f %{name}.files
%defattr(-,root,root,755)
%doc Changes README

%changelog

Places

File perl-String-REPartition.spec of Package perl-String-REPartition

Places