File bioawk.spec of Package bioawk

# spec file for package bioawk
# Copyright (c) 2012 Scott R. Santos (, Auburn University
# This file and all modifications and additions to the pristine
# package are under the same license as the package itself.
# Please submit bugfixes or comments via 
# <>

Name:             bioawk
Version:          030715
Release: 	  0
License:          GPLv2
Group:            Other
Summary:          Bioawk is an extension to Brian Kernighan's awk for dealing with several common biological data formats
Source:           %{name}-%{version}.zip
BuildRoot:        %{_tmppath}/%{name}-%{version}-build
BuildRequires:    gcc bison byacc

Bioawk is an extension to Brian Kernighan's awk acquired from [1], adding the
support of several common biological data formats, including optionally gzip'ed
BED, GFF, SAM, VCF, FASTA/Q and TAB-delimited formats with the column names. It
also adds a few built-in functions including, as of now, and(), or(), xor(),
reverse() and revcomp(). The following are a few examples demonstrating the new

1. Extract unmapped reads without header:

   awk -c sam 'and($flag,4)' aln.sam.gz

2. Extract mapped reads with header:

   awk -c sam -H '!and($flag,4)'

3. Reverse complement FASTA:

   awk -c fastx '{print ">"$name;print revcomp($seq)}' seq.fa.gz

4. Create FASTA from SAM (uses revcomp if FLAG & 16)

   samtools view aln.bam | \
      awk -c sam '{s=$seq; if(and($flag, 16)) {s=revcomp($seq)} print ">"$qname"\n"s}'

5. Get the %GC from FASTA:

   awk -c fastx '{print ">"$name;print gc($seq)}' seq.fa.gz

6. Get the mean Phred quality score from FASTQ:

   awk -c fastx '{print ">"$name;print meanqual($qual)}' seq.fq.gz

7. Take column name from the first line (where "age" appears in the first line
   of input.txt):

   awk -c header '{print $age}' input.txt

Note that when "-c" is not specified and the new built-in functions are not
used, bioawk should behave exactly the same as the original BWK awk. At least
this is the intention. Bioawk also tries to minimize the modification to the
original code base such that improvements in the future versions of BWK awk
can be readily incorporated into bioawk (yes, Brian Kernighan is still
maintaining his code).

Bioawk may have the following limitations:

1. To parse FASTA and FASTQ formats, bioawk replaces the line reading module of
   awk, which also allows bioawk to seamlessly parse gzip'ed files. However,
   the new line reading code does not fully mimic the original code. It may
   fail in corner cases. Thus when "-c" is not specified, awk falls back to the
   original line reading code and does not support gzip'ed input.

2. When "-c" is in use, several strings allocated in the new line reading
   module are not freed in the end. These will be reported by valgrind as
   "still reachable". To some extend, these are not memory leaks.


%setup -q -n %{name}-%{version}


test "%{buildroot}" != "/" && %__rm -rf %{buildroot}

%__mkdir_p %{buildroot}/%{_bindir}
%__mkdir_p %{buildroot}/%{_mandir}/man1

%{__install} -m 755 bioawk %{buildroot}%{_bindir}/bioawk
%{__install} -m 0644 awk.1 %{buildroot}/%{_mandir}/man1/bioawk.1


* Sat Mar 07 2015 Scott Santos <> - 030715
- updated to git version - 030715

* Thu Nov 22 2012 Scott Santos <> - 1.0.0
- initial release - 1.0.0