Transpose in awk

May 27th, 2009

Here is a script that I found here originally. It transposes a file from rows to columns using awk:

#!/usr/bin/awk -f
# this script transposes the data in a file (exchanging rows/columns)

# first line without leading delimiter
NR == 1 {
                for ( i = 1 ; i <= NF ; i++ ){
                        line[i] = $i;
                }
                next;
}

# add the content of lines to the line array
{
        for ( i = 1 ; i <= NF ; i++ ){
                line[i] = line[i] FS $i;
        }
}

# print the created lines
END {
        #for (ind in line) print line[ind]
        for ( i = 1 ; i <= length(line) ; i++){
                print line [i]
        }
}

Use it like this for a comma delimited CSV file:

transpose.awk -F ',' myfile.csv > newfile.csv

sed and awk

April 27th, 2009

Ok, carlosm pointed out that I ignored sed and awk when I talked about Unix tools for data manipulation. I actually did this intentionally because they are too big to just lump into the other set. They are even too big for this little blog post, but I’ll make it brief. 

Sed is a non-interactive text editor. Awk is a pattern processing language with C-like syntax. Together, they let you do some amazing processing of text files. 

Sed basically allows you to perform regular expressions on a data stream (e.g. a file or stdin). It can take simple regular expressions or you can make complex scripts. I mostly use it for command-line stuff and leave more complex scripts to Perl. As an example,


sed -n ‘/foo/’ file.txt

will find all lines NOT containing foo in file.txt. Similarly,

sed ’s/foo/bar/g’ file.txt

will replace all occurences of foo with bar in file.txt.

Awk is more like a generalized version of the cut, paste, etc. commands that I mentioned in a previous post. As an example,


awk ‘{print $2;}’

will print the second column. In general, awk can match lines AND perform an action. It takes the form:

awk <search pattern> {<program actions>}

where the search pattern can be a simple regular expression, a boolean combination of regular expressions, etc.

The action can be as simple as a print statement, like we saw, or an arbitrarily complex program. For example, it could sum values in a file for a simple program.

In lieu of trying to summarize everything, I will just point to a few URLS:
http://stud.wsi.edu.pl/~robert/awk/
http://www.vectorsite.net/tsawk.html

Fitting Functions in Gnuplot

April 25th, 2009

It’s been a long time since I updated this. One feature that I frequently use in Gnuplot is the “fit” command. It allows you to do least squares regression of any function. It works really well for linear, quadratic, etc., but I have found that it isn’t so great at complex non-linear functions. I’m not sure what underlying algorithm it uses.

To perform a fit, you need your data (e.g. file.dat) in a file or else you have to replicate it in-line twice.  Then create a function like this:

g(x) = m*x+b

And assign initial guesses (not 0 or else it will result in a singular matrix):

m=1
b=1

Then fit the function:

fit g(x) 'file.dat' via m,b

The one annoying thing is that there is no easy way to add the function as a title of the plot. You can plot your data and the fit function like this:

plot 'file.dat' title 'Data', g(x) title '0.004*x+1.2'

If anyone knows how to make the title automatically from the function, please let me know!

Unix Data Manipulation

August 13th, 2008

Standard unix distributions (e.g. CentOS, RHEL, etc.) come with many commands that are useful for graph data extraction and manipulation. A brief list of some of these commands are:

  • grep - to extract data
  • sort - to order data
  • cut - to extract columns from data
  • paste - to reattach columns of data
  • uniq - to remove duplicated lines
  • join - to perform a relational join of lines
  • cmp - to compare files
  • diff - to show the differences between files
  • head - to extract the first lines of a file
  • tail - to extract the last lines of a file
  • tr - to translate character sets
  • sed - to perform regular expressions

These can be used in combination to extract data from a file, parse out the necessary fields, sort the fields, and add them to a data file. Sure beats cut & paste by hand!

Portable Anymap (PNM) Formats

July 22nd, 2008

I have used ImageMagick quite often for a variety of projects and I often noted the many “pnm” utilities that come with it. For example, pnmrotate, pnmscale, pnmcat, etc.  I never really paid attention to what these formats are, however. PNM provides a standard format for uncompressed bitmaps (PBM), grayscale (PGM) and color images (PPM) that is very easy to parse and write. These are perfect for manipulating with Perl, C, or any other language where you want to do a low-level hack.

There are 6 types of these images and they are all identified by the first byte of the file. This byte provides a “magic number” and is decoded as:

  • P1: ASCII bitmap
  • P2: ASCII grayscale
  • P3: ASCII color
  • P4: Binary bitmap
  • P5: Binary grayscale
  • P6: Binary color

After the magic number, there can be an arbitrary amount of whitespace followed by a newline. Then, the next line is an optional comment line if it begins with a “#”. After the comment terminates with a newline, there is again an arbitrary amount of white space followed by a newline. Now, the size of the image is given in pixels using standard ASCII. For example, “640 480″ would be a 640 wide by 480 tall image. The header information can be summarized like this example:

 P3
 # feep.ppm
 4 4

What follows next depends on the image format. Each one is presented briefly.

For P1/P4 bitmap formats, the rasterized bitmap data is in the file. For P1 types, this data is just shown as ASCII “0″ and “1″s with optional whitespace. For P4 types, the data is packed (to the left) big-endian binary. The binary data is padded with don’t care values to give a full line. It is important, however, that NO LINE CAN EXTEND MORE THAN 70 BYTES/CHARACTERS. Some tools ignore this, but others don’t. ASCII example:

P1
# feep.pbm
24 7
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 1 1 0 0 1 1 1 1 0 0 1 1 1 1 0 0 1 1 1 1 0
0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0
0 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 1 0
0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
0 1 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

For P2/P5 grayscale formats, we first get a single integer that describes the maximum range of the gray. It can be 0 to 65536. 255 would be white while 0 is black.Following this, the data is again encoding as either ASCII numbers or 1-2 bytes that are big enough to hold the maximum value. ASCII example:

P2
# feep.pgm
24 7
15
0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
0  3  3  3  3  0  0  7  7  7  7  0  0 11 11 11 11  0  0 15 15 15 15  0
0  3  0  0  0  0  0  7  0  0  0  0  0 11  0  0  0  0  0 15  0  0 15  0
0  3  3  3  0  0  0  7  7  7  0  0  0 11 11 11  0  0  0 15 15 15 15  0
0  3  0  0  0  0  0  7  0  0  0  0  0 11  0  0  0  0  0 15  0  0  0  0
0  3  0  0  0  0  0  7  7  7  7  0  0 11 11 11 11  0  0 15  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

For P3/P6 color images, the format is the same as the grayscale images except that there are now 3, instead of 1, numbers to represent red, blue and green. ASCII example:

P3
# feep.ppm
4 4
15
 0  0  0    0  0  0    0  0  0   15  0 15
 0  0  0    0 15  7    0  0  0    0  0  0
 0  0  0    0  0  0    0 15  7    0  0  0
15  0 15    0  0  0    0  0  0    0  0  0

Animations

June 3rd, 2008

A really fun feature of gnuplot in the more recent versions is animated GIFs. This requires version 2.0.28 of the Boutell gd library and 4.3(?) of gnuplot. You do this with the terminal type:

set terminal gif animate delay 10

and then plot several items. The delay is in 1/100th of a second, so each frame in this case is 0.1s (10 frames per second). Combined with yesterday’s post, we can make animated droplets:

Animated GIF Example
The entire source code for the example is here.

Variable Size Points

June 2nd, 2008

Example of variable-size data points.

I remember writing a Perl hack to do this some years ago, but it appears that you can now specify the size of the points in a plot as a 3rd variable. For example,

plot '-' using 1:2:3 with points lt 1 pt 6 ps variable
1 3 8
6 2 2
5 5 4
e

Point type 6 is circles and the 3rd column specifies the size of the circles.

Geometric Layout Plots

June 1st, 2008

One of the most frequent things that I use gnuplot for is to view geometric layouts. This can be done by simply generating data for the shapes in clockwise or counter-clockwise order. For example,

set xrange [-1:21]
set yrange [-1:21]
plot ‘-’ with lp
0 0
0 10
10 10
10 0
0 0

10 10
10 20
20 20
20 10
10 10
e

will plot two squares of size “10″ on a side. Multiple shapes are added to the same data source by leaving a blank line between the data sets.

I have used this to plot VLSI circuit layouts in a very portable manner (since gnuplot is on almost every platform). Here is a small circuit:

If the data is not “in line” (i.e. you put it in a file instead of the ‘-’) then you can also interactively zoom when using the X11 terminal. This is a newer feature of gnuplot 3.8+. You simply use the right mouse button to select a region and then click the left button to execute the zoom. Pressing “p” will return to the previous scale.

Welcome

June 1st, 2008

I’ve always been fascinated by cool graphs. To quote one of my favorite movies:

(1) Mathematics is the language of nature; (2) Everything around us can be represented and understood from numbers; (3) If you graph the numbers in any systems, patterns emerge.

- Maximillian Cohen, Pi by Darren Aronofsky (1998)

This is a site dedicated to creating, sharing, and enabling the creation of interesting graphs. I use many free tools to do this such as gnuplot, graphviz, xgraph, xfig, Inkscape, and hacks with Perl and Unix to manipulate data.

Part of the value of this site is also a tuturial for people that are new to data visualization on Unix/Linux/OSX platforms. Periodically, I will post “tips” about tools and key features of these tools.