File perl-Algorithm-TicketClusterer.spec of Package perl-Algorithm-TicketClusterer

Overview Repositories Revisions Requests Users Attributes Meta

File perl-Algorithm-TicketClusterer.spec of Package perl-Algorithm-TicketClusterer

#
# spec file for package perl-Algorithm-TicketClusterer
#
# Copyright (c) 2016 SUSE LINUX GmbH, Nuernberg, Germany.
#
# All modifications and additions to the file contributed by third parties
# remain the property of their copyright owners, unless otherwise agreed
# upon. The license for this file, and modifications and additions to the
# file, is the same license as for the pristine package itself (unless the
# license for the pristine package is not an Open Source License, in which
# case the license is the MIT License). An "Open Source License" is a
# license that conforms to the Open Source Definition (Version 1.9)
# published by the Open Source Initiative.

# Please submit bugfixes or comments via http://bugs.opensuse.org/
#

Name:           perl-Algorithm-TicketClusterer
Version:        1.01
Release:        0
%define cpan_name Algorithm-TicketClusterer
Summary:        Perl module for retrieving Excel-stored past
License:        GPL-1.0+ or Artistic-1.0
Group:          Development/Libraries/Perl
Url:            http://search.cpan.org/dist/Algorithm-TicketClusterer/
Source0:        http://www.cpan.org/authors/id/A/AV/AVIKAK/%{cpan_name}-%{version}.tar.gz
BuildArch:      noarch
BuildRoot:      %{_tmppath}/%{name}-%{version}-build
BuildRequires:  perl
BuildRequires:  perl-macros
BuildRequires:  perl(Spreadsheet::ParseExcel) >= 0.59
BuildRequires:  perl(Spreadsheet::XLSX) >= 0.13
BuildRequires:  perl(Text::Iconv) >= 1.7
BuildRequires:  perl(WordNet::QueryData) >= 1.47
Requires:       perl(Spreadsheet::ParseExcel) >= 0.59
Requires:       perl(Spreadsheet::XLSX) >= 0.13
Requires:       perl(Text::Iconv) >= 1.7
Requires:       perl(WordNet::QueryData) >= 1.47
%{perl_requires}

%description
*Algorithm::TicketClusterer* is a _perl5_ module for retrieving previously
processed Excel-stored tickets similar to a new ticket. Routing decisions
made for the past similar tickets can be useful in expediting the routing
of a new ticket.

Tickets are commonly used in software services industry and customer
support businesses to record requests for service, product complaints, user
feedback, and so on.

With regard to the routing of a ticket, you would want each new ticket to
be handled by the tech support individual who is most qualified to address
the issue raised in the ticket. Identifying the right individual for each
new ticket in real-time is no easy task for organizations that man large
service centers and helpdesks. So if it were possible to quickly identify
the previously processed tickets that are most similar to a new ticket, one
could think of constructing semi-automated (or, perhaps, even fully
automated) ticket routers.

Identifying old tickets similar to a new ticket is made challenging by the
fact that folks who submit tickets often write them quickly and informally.
The informal style of writing means that different people may use different
colloquial terms to describe the same thing. And the quickness associated
with their submission causes the tickets to frequently contain spelling and
other errors such as conjoined words, fragmentation of long words, and so
on.

This module is an attempt at dealing with these challenges.

The problem of different people using different words to describe the same
thing is taken care of by using WordNet to add to each ticket a designated
number of synonyms for each word in the ticket. The idea is that after all
the tickets are expanded in this manner, they would become grounded in a
common vocabulary. The synonym expansion of a ticket takes place only after
the negated phrases (that is, the words preceded by 'no' or 'not') are
replaced by their antonyms.

Obviously, expanding a ticket by synonyms makes sense only after it is
corrected for spelling and other errors. What sort of errors one looks for
and corrects would, in general, depend on the application domain of the
tickets. (It is not uncommon for engineering services to use jargon words
and acronyms that look like spelling errors to those not familiar with the
services.) The module expects to see a file that is supplied through the
constructor parameter 'misspelled_words_file' that contains misspelled
words in the first column and their corrected versions in the second
column. An example of such a file is included in the 'examples' directory.
You would need to create your own version of such a file for your
application domain. Since conjuring up the misspellings that your ticket
submitters are likely to throw at you is futile, you might consider using
the following approach which I prefer to actually reading the tickets for
such errors: Turn on the debugging options in the constructor for some
initially collected spreadsheets and watch what sort of words the WordNet
is not able to supply any synonyms for. In a large majority of cases, these
would be the misspelled words.

Expanding a ticket with synonyms is made complicated by the fact that some
common words have such a large number of synonyms that they can overwhelm
the relatively small number of words in a ticket. Adding too many synonyms
in relation to the size of a ticket can not only distort the sense of the
ticket but it can also increase the computational cost of processing all
the tickets.

In order to deal with the pros and the cons of using synonyms, the present
module strikes a middle ground: You can specify how many synonyms to use
for a word (assuming that the number of synonyms supplied by WordNet is
larger than the number specified). This allows you to experiment with
retrieval precision by altering the number of synonyms used. The retained
synonyms are selected randomly from those supplied by WordNet. (A smarter
way to select synonyms would be to base them on the context. For example,
you would not want to use the synonym `programmer' for the noun `developer'
if your application domain is real-estate. However, such context-dependent
selection of synonyms would take us into the realm of ontologies that I
have chosen to stay away from in this first version of the module.)

Another issue related to the overall run-time performance of this module is
the computational cost of the calls to WordNet through its Perl interface
'WordNet::QueryData'. This module uses what I have referred to as _synset
caching_ to make this process as efficient as possible. The result of each
WordNet lookup is cached in a database file whose name you supply through
the constructor option 'synset_cache_db'. If you are doing a good job of
catching spelling errors, the module will carry out a decreasing number of
WordNet lookups as the tickets are scanned for expansion with synonyms. In
an experiment with a spreadsheet that contained over 1400 real tickets, the
last several hundred resulted in hardly any calls to WordNet.

As currently programmed, the synset cache is deleted and then created
afresh at every call to the function that extracts information from an
Excel spreadsheet. You would want to change this behavior of the module if
you are planning to use it in a production environment where the different
spreadsheets are likely to deal with the same application domain. To give
greater persistence to the synset cache, comment out the 'unlink
$self-'{_synset_cache_db}> line in the method 'get_tickets_from_excel()'.
After a few updates of the synset cache, the module would almost never need
to make direct calls to WordNet, which would enhance the speed of the
module even further.

The textual content of the tickets, as produced by the preprocessing steps,
is used for document modeling and the doc model thus created used
subsequently for retrieving similar tickets. The doc modeling is carried
out using the Vector Space Model (VSM) in which each ticket is represented
by a vector whose size equals the size of the vocabulary used in all the
tickets and whose elements represent the word frequencies in the ticket.
After such a model is constructed, a query ticket is compared with the
other tickets on the basis of the cosine similarity distance between the
corresponding vectors.

My decision to use the simplest of the text models --- the Vector Space
Model --- was based of the work carried out by Shivani Rao at Purdue who
has demonstrated that the simpler models are more effective at retrieval
from software libraries than the more complex models. (See the paper by
Shivani Rao and Avinash Kak at the MSR'11 Conference.) Although tickets, in
general, are not the same as software libraries, I have a strong feeling
that Shivani's conclusions would extend to other domains as well. Having
said that, it is important to mention that there remains the possibility
that automated ticket routing for some applications may respond better to
more elaborate text models.

The module uses three mechanisms to speed up the retrieval of tickets
similar to a query ticket: (1) It uses the inverted index for all the words
to construct for each query ticket a candidate pool of only those tickets
in the database that have words in common with the query ticket; (2) Only
those query-ticket words are used for retrieval whose
inverse-document-frequency values exceed a user-specified threshold; and
(3) The module uses stemming to reduce the variants of the same word to a
common root in order to limit the size of the vocabulary. The stemming used
in the current module is rudimentary. However, it would be easy to plug
into the module more powerful stemmers through their Perl interfaces.
Future versions of this module may do exactly that.

%prep
%setup -q -n %{cpan_name}-%{version}
find . -type f ! -name \*.pl -print0 | xargs -0 chmod 644

%build
%{__perl} Makefile.PL INSTALLDIRS=vendor
%{__make} %{?_smp_mflags}

%check
%{__make} test

%install
%perl_make_install
%perl_process_packlist
%perl_gen_filelist

%files -f %{name}.files
%defattr(-,root,root,755)
%doc examples README

%changelog

Places

File perl-Algorithm-TicketClusterer.spec of Package perl-Algorithm-TicketClusterer

Places