File perl-Algorithm-VSM.spec of Package perl-Algorithm-VSM

#
# spec file for package perl-Algorithm-VSM
#
# Copyright (c) 2016 SUSE LINUX GmbH, Nuernberg, Germany.
#
# All modifications and additions to the file contributed by third parties
# remain the property of their copyright owners, unless otherwise agreed
# upon. The license for this file, and modifications and additions to the
# file, is the same license as for the pristine package itself (unless the
# license for the pristine package is not an Open Source License, in which
# case the license is the MIT License). An "Open Source License" is a
# license that conforms to the Open Source Definition (Version 1.9)
# published by the Open Source Initiative.

# Please submit bugfixes or comments via http://bugs.opensuse.org/
#


Name:           perl-Algorithm-VSM
Version:        1.70
Release:        0
%define cpan_name Algorithm-VSM
Summary:        Perl module for retrieving files and documents from a software
License:        GPL-1.0+ or Artistic-1.0
Group:          Development/Libraries/Perl
Url:            http://search.cpan.org/dist/Algorithm-VSM/
Source0:        http://www.cpan.org/authors/id/A/AV/AVIKAK/%{cpan_name}-%{version}.tar.gz
BuildArch:      noarch
BuildRoot:      %{_tmppath}/%{name}-%{version}-build
BuildRequires:  perl
BuildRequires:  perl-macros
BuildRequires:  perl(PDL)
Requires:       perl(PDL)
%{perl_requires}

%description
*Algorithm::VSM* is a _perl5_ module for constructing a Vector Space Model
(VSM) or a Latent Semantic Analysis Model (LSA) of a collection of
documents, usually referred to as a corpus, and then retrieving the
documents in response to search words in a query.

VSM and LSA models have been around for a long time in the Information
Retrieval (IR) community. More recently such models have been shown to be
effective in retrieving files/documents from software libraries. For an
account of this research that was presented by Shivani Rao and the author
of this module at the 2011 Mining Software Repositories conference, see
http://portal.acm.org/citation.cfm?id=1985451.

VSM modeling consists of: (1) Extracting the vocabulary used in a corpus.
(2) Stemming the words so extracted and eliminating the designated stop
words from the vocabulary. Stemming means that closely related words like
'programming' and 'programs' are reduced to the common root word 'program'
and the stop words are the non-discriminating words that can be expected to
exist in virtually all the documents. (3) Constructing document vectors for
the individual files in the corpus --- the document vectors taken together
constitute what is usually referred to as a 'term-frequency' matrix for the
corpus. (4) Normalizing the document vectors to factor out the effect of
document size and, if desired, multiplying the term frequencies by the IDF
(Inverse Document Frequency) values for the words to reduce the weight of
the words that appear in a large number of documents. (5) Constructing a
query vector for the search query after the query is subject to the same
stemming and stop-word elimination rules that were applied to the corpus.
And, lastly, (6) Using a similarity metric to return the set of documents
that are most similar to the query vector. The commonly used similarity
metric is one based on the cosine distance between two vectors. Also note
that all the vectors mentioned here are of the same size, the size of the
vocabulary. An element of a vector is the frequency of occurrence of the
word corresponding to that position in the vector.

LSA modeling is a small variation on VSM modeling. Now you take VSM
modeling one step further by subjecting the term-frequency matrix for the
corpus to singular value decomposition (SVD). By retaining only a subset of
the singular values (usually the N largest for some value of N), you can
construct reduced-dimensionality vectors for the documents and the queries.
In VSM, as mentioned above, the size of the document and the query vectors
is equal to the size of the vocabulary. For large corpora, this size may
involve tens of thousands of words --- this can slow down the VSM modeling
and retrieval process. So you are very likely to get faster performance
with retrieval based on LSA modeling, especially if you store the model
once constructed in a database file on the disk and carry out retrievals
using the disk-based model.

%prep
%setup -q -n %{cpan_name}-%{version}

%build
%{__perl} Makefile.PL INSTALLDIRS=vendor
%{__make} %{?_smp_mflags}

%check
%{__make} test

%install
%perl_make_install
%perl_process_packlist
%perl_gen_filelist

%files -f %{name}.files
%defattr(-,root,root,755)
%doc examples README

%changelog
openSUSE Build Service is sponsored by