Seek and Destroy

Sed--the stream editor--can read, then modify text on the fly. While the syntax for sed appears cryptic at first blush, it is a versatile text-processing tool worth adding to your UNIX utility belt.

In the real world, one could use sed in combination with the find utility to recursively locate, then remove: 1) all HTML elements, and 2) all leading whitespace from the .php files of a website.

Here's one approach to that task:

#!/usr/bin/env bash

find ./ -type f -name '*.php' \
  -exec sed -i 's/<[^>]*>//g; s/^[ \t]*//g ' {} \;

Let's take a look at the find portion of the snippet first. For reference, the syntax for this command is:

find [location to search] [expression] [-option(s)] [what to find]

In our example, we tell find to begin from the present working directory ./, and search only for files -type f that end with the '*.php' extension. We then instruct find to -execute two (2) sed commands.

Ouch! Those sed commands look evil. We'll break them down, but before we do, pay attention to the syntax for the command:

sed Options... [script] [inputfile]

and note the following:

-i = in-place edit
s = substitute
g = global
// = delete

Our first expression:

s/<[^>]*>//g

looks for an opening tag <
which is followed by zero or more characters *, which are NOT ^ a closing tag >
then looks for a closing tag >

replacing matches with nothing.

Our second expression:

s/^[ \t]*//g

looks for lines that start ^ with tabs \t

replacing matches with nothing.

Find then brings it home by recursively {} running our commands, and draws to a close with the semi-colon ;.

Viola! With a little elbow grease, we were able to transform reams of text in to a format suitable for export elsewhere.

Cheers.