Stanislav Brabec
sbrabec
Involved Projects and Packages
HTML::TableExtract is a subclass of HTML::Parser that serves to extract the
information from tables of interest contained within an HTML document. The
information from each extracted table is stored in table objects. Tables
can be extracted as text, HTML, or HTML::ElementTable structures (for
in-place editing or manipulation).
There are currently four constraints available to specify which tables you
would like to extract from a document: _Headers_, _Depth_, _Count_, and
_Attributes_.
_Headers_, the most flexible and adaptive of the techniques, involves
specifying text in an array that you expect to appear above the data in the
tables of interest. Once all headers have been located in a row of that
table, all further cells beneath the columns that matched your headers are
extracted. All other columns are ignored: think of it as vertical slices
through a table. In addition, TableExtract automatically rearranges each
row in the same order as the headers you provided. If you would like to
disable this, set _automap_ to 0 during object creation, and instead rely
on the column_map() method to find out the order in which the headers were
found. Furthermore, TableExtract will automatically compensate for cell
span issues so that columns are really the same columns as you would
visually see in a browser. This behavior can be disabled by setting the
_gridmap_ parameter to 0. HTML is stripped from the entire textual content
of a cell before header matches are attempted -- unless the _keep_html_
parameter was enabled.
_Depth_ and _Count_ are more specific ways to specify tables in relation to
one another. _Depth_ represents how deeply a table resides in other tables.
The depth of a top-level table in the document is 0. A table within a
top-level table has a depth of 1, and so on. Each depth can be thought of
as a layer; tables sharing the same depth are on the same layer. Within
each of these layers, _Count_ represents the order in which a table was
seen at that depth, starting with 0. Providing both a _depth_ and a _count_
will uniquely specify a table within a document.
_Attributes_ match based on the attributes of the html tag, for
example, boder widths or background color.
Each of the _Headers_, _Depth_, _Count_, and _Attributes_ specifications
are cumulative in their effect on the overall extraction. For instance, if
you specify only a _Depth_, then you get all tables at that depth (note
that these could very well reside in separate higher- level tables
throughout the document since depth extends across tables). If you specify
only a _Count_, then the tables at that _Count_ from all depths are
returned (i.e., the _n_th occurrence of a table at each depth). If you only
specify _Headers_, then you get all tables in the document containing those
column headers. If you have specified multiple constraints of _Headers_,
_Depth_, _Count_, and _Attributes_, then each constraint has veto power
over whether a particular table is extracted.
If no _Headers_, _Depth_, _Count_, or _Attributes_ are specified, then all
tables match.
When extracting only text from tables, the text is decoded with
HTML::Entities by default; this can be disabled by setting the _decode_
parameter to 0.
Extraction Modes
The default mode of extraction for HTML::TableExtract is raw text or
HTML. In this mode, embedded tables are completely decoupled from one
another. In this case, HTML::TableExtract is a subclass of
HTML::Parser:
use HTML::TableExtract;
Alternativevly, tables can be extracted as HTML::ElementTable
structures, which are in turn embedded in an HTML::Element tree
representing the entire HTML document. Embedded tables are not
decoupled from one another since this tree structure must be
manitained. In this case, HTML::TableExtract is a subclass of
HTML::TreeBuilder (itself a subclass of HTML:::Parser):
use HTML::TableExtract qw(tree);
In either case, the basic interface for HTML::TableExtract and the
resulting table objects remains the same -- all that changes is what
you can do with the resulting data.
HTML::TableExtract is a subclass of HTML::Parser, and as such inherits
all of its basic methods such as 'parse()' and 'parse_file()'. During
scans, 'start()', 'end()', and 'text()' are utilized. Feel free to
override them, but if you do not eventually invoke them in the SUPER
class with some content, results are not guaranteed.
Advice
The main point of this module was to provide a flexible method of
extracting tabular information from HTML documents without relying to
heavily on the document layout. For that reason, I suggest using
_Headers_ whenever possible -- that way, you are anchoring your
extraction on what the document is trying to communicate rather than
some feature of the HTML comprising the document (other than the fact
that the data is contained in a table).
pkcs11-helper allows using multiple PKCS#11 providers at the same
time and selecting keys by id, label or certificate subject.
Besides it covers the following topics: * Handling card removal
and card insert events
* Handling card re-insert to a different slot
* Supporting session expiration serialization
* and much more All this is possible using a simple API.
Potrace is a utility for tracing a bitmap, which means, transforming a
bitmap into a smooth, scalable image. The input is a bitmap (PBM, PGM,
PPM, or BMP), and the default output is one of several vector file
formats. A typical use is to create EPS files from scanned data, such
as company or university logos, handwritten notes, etc. The resulting
image is not "jaggy" like a bitmap, but smooth. It can then be rendered
at any resolution.
PowerMan is a tool for manipulating remote power control (RPC) devices from a
central location. Several RPC varieties are supported natively by PowerMan and
Expect-like configurability simplifies the addition of new devices.
pstoedit converts PostScript and PDF files to other vector graphic
formats so that they can be edited graphically.
pstoedit supports:
* Tgif .obj format (for tgif version >= 3)
* .fig format for xfig
* pdf - Adobe's Portable Document Format
* gnuplot format
* Flattened PostScript (with or without Bezier curves)
* DXF - CAD exchange format
* LWO - LightWave 3D
* RIB - RenderMan
* RPL - Real3D
* Java 1 or Java 2 applet
* Idraw format (in fact a special form of EPS that idraw can read)
* Tcl/Tk
* HPGL
* AI (Adobe Illustrator) (based on ps2ai.ps - not a real pstoedit driver - see notes below and manual)
* Windows Meta Files (WMF) (Windows only)
* Enhanced Windows Meta Files (EMF) (Windows, but also Linux/Unix if libemf is available)
* OS/2 meta files (OS/2 only)
* PIC format for troff/groff
* MetaPost format for usage with TeX/LaTeX
* LaTeX2e picture
* Kontour
* GNU Metafile (plotutils / libplot)
* Skencil ( http://www.skencil.org )
* Mathematica
* via ImageMagick to any format supported by ImageMagick
* SWF
* CNC G code
* VTK files for ParaView and similar visualization tools
wxWidgets is a free C++ library for cross-platform GUI. This package contains python bindings for wxWidgets.
quvi is a commandline tool for parsing video download links. It supports
Youtube and other similar video Web sites.
This package converts various character sets.
rzsz allows you to use "sz filename" to send a file to your local
system.
This package is based on the package 'sampleicc' from project 'home:bekun'.
SampleICC provides an open source platform independent C++ library for reading, writing, manipulating, and applying ICC profiles along with applications that make use of this library.
Sed takes text input, performs one or more operations on it, and
outputs the modified text. Sed is typically used for extracting parts
of a file using pattern matching or for substituting multiple
occurrences of a string within a file.
SILC Toolkit is a software development toolkit which provides full SILC
protocol implementation for application developers. The SILC Toolkit
provides SILC Client Library, SILC Protocol Core Library, SILC Key
Exchange Library, SILC Crypto Library, SILC Math Library, SILC Utility
Library, as well as other libraries. The SILC Toolkit also includes a
full reference manual and developer guide with examples and tutorials.
SILC (Secure Internet Live Conferencing) is a protocol which provides
secure conferencing services on the Internet over insecure channels.
SILC is similar to IRC, as they both provide conferencing services and
almost have the same commands. However, they differ internally: unlike
IRC, SILC is secure and has an entirely different network model
compared to IRC.
Site configuration for autoconf based configure scripts provides smart
defaults for paths that are not specified.
SMARTmontools controls and monitors storage devices using the
Self-Monitoring, Analysis, and Reporting Technology System (S.M.A.R.T.)
built into ATA, SATA and SCSI Hard Drives. This is used to check the
hard drive reliability and to predict drive failures. The suite
contains two utilities. The first, smartctl, is a command line utility
designed to perform simple S.M.A.R.T. tasks. The second, smartd, is a
daemon that periodically monitors the smart status and reports errors
to syslog. The package is compatible with the ATA/ATAPI-3 to -7
specification. The package is intended to incorporate as much "vendor
specific" and "reserved" information as possible about disk drives. The
commands man smartctl and man smartd will provide more information.
SoundTouch is an open source audio processing library that allows
changing the sound tempo, pitch and playback rate parameters
independently from each other.
*Easy-to-use implementation of time-stretch, pitch-shift and sample
rate transposing routines.
*High-performance object-oriented C++ implementation.
*Clear and easy-to-use programming interface via a single C++ class.
*Supported audio data format : 16Bit integer or 32bit floating point
PCM mono/stereo
*Capable of real-time audio stream processing:
input/output latency max. ~ 100 ms.
Processing 44.1kHz/16bit stereo sound in realtime requires a 133 Mhz
Intel Pentium processor or better.
*Additional assembler-level and Intel-MMX instruction set optimizations
for Intel x86 compatible processors, offering several times increase
in the processing performance.
This is a tool providing update of translations using available
upstream resources.
The tool tool is intended for use during package compilation as a first
command after unpacking of source code.
For more see README and HOWTO.
This package also includes translation update data files.
GNU VCDImager is a full-featured mastering suite for authoring,
disassembling and analyzing Video CDs and Super Video CDs.
The following features are available so far:
Support for Video CD 1.1 and 2.0 disc formats
Support for the Super Video CD 1.0 disc format
Full PBC (playback control) support (play lists, selection lists and
end lists)
Support for segment play items
Automatic padding of MPEG streams on the fly
Support for 99-minute (out-of-specification) CD-R media
Extraction of Video CDs into files (incl. the PBC information)
Use of XML for the description of Video CDs
VIGRA stands for "Vision with Generic Algorithms". It is a novel
computer vision library that puts its main emphasis on customizable
algorithms and data structures. By using template techniques similar to
those in the C++ Standard Template Library, you can easily adapt any
VIGRA component to the needs of your application, without giving up
execution speed.
WavPack is a completely open audio compression format providing
lossless, high-quality lossy, and unique hybrid compression modes.
Although the technology is loosely based on previous versions of
WavPack, the new version 4 format has been designed from the ground up
to offer unparalleled performance and functionality.
By default, lossless mode WavPack acts just like a WinZip compressor
for audio files. However, unlike MP3 or WMA encoding which can affect
the sound quality, not a single bit of the original information is
lost, so there is no chance of degradation. This makes lossless mode
ideal for archiving audio material or any other situation where quality
is paramount. The compression ratio depends on the source material, but
generally is between 30% and 70%.
The hybrid mode provides all the advantages of lossless compression
with an additional bonus. Instead of creating a single file, this mode
creates both, a relatively small, high-quality lossy file that can be
used all by itself, and a "correction" file that (when combined with
the lossy file) provides full lossless restoration. For some users this
means never having to choose between lossless and lossy compression!
WebRTC is an open source project that enables web browsers with Real-Time
Communications (RTC) capabilities via simple Javascript APIs. The WebRTC
components have been optimized to best serve this purpose.
WebRTC implements the W3C's proposal for video conferencing on the web.
wxWidgets is a free C++ library for cross-platform GUI.
With wxWidgets, you can create applications for different GUIs (GTK+,
Motif, MS Windows, MacOS X, Windows CE, GPE) from the same source code.
wxWidgets is a free C++ library for cross-platform GUI.
With wxWidgets, you can create applications for different GUIs (GTK+,
Motif, MS Windows, MacOS X, Windows CE, GPE) from the same source code.
This package contains wxWidgets documentation in HTML format.
wxWidgets is a free C++ library for cross-platform GUI.
With wxWidgets, you can create applications for different GUIs (GTK+,
Motif, MS Windows, MacOS X, Windows CE, GPE) from the same source code.
wxWidgets is a free C++ library for cross-platform GUI.
With wxWidgets, you can create applications for different GUIs (GTK+,
Motif, MS Windows, MacOS X, Windows CE, GPE) from the same source code.