Seek and Destroy
Sed--the stream editor--can read, then modify text on the fly. While the syntax for sed appears cryptic at first blush, it is a versatile text-processing tool worth adding to your UNIX utility belt.
In the real world, one could use sed in combination with the find utility to recursively locate, then remove: 1) all HTML elements, and 2) all leading whitespace from the .php files of a website.
Here's one approach to that task:
#!/usr/bin/env bash
find ./ -type f -name '*.php' \
-exec sed -i 's/<[^>]*>//g; s/^[ \t]*//g ' {} \;
Let's take a look at the find portion of the snippet first. For reference, the syntax for this command is:
find [location to search] [expression] [-option(s)] [what to find]
In our example, we tell find to begin from the present working directory ./, and search only for files -type f that end with the '*.php' extension. We then instruct find to -execute two (2) sed commands.
Ouch! Those sed commands look evil. We'll break them down, but before we do, pay attention to the syntax for the command:
sed Options... [script] [inputfile]
and note the following:
-i = in-place edit
s = substitute
g = global
// = delete
Our first expression:
s/<[^>]*>//g
- looks for an opening tag
< - which is followed by zero or more characters
*, which are NOT^a closing tag> - then looks for a closing tag
>
replacing matches with nothing.
Our second expression:
s/^[ \t]*//g
- looks for lines that start
^with tabs\t
replacing matches with nothing.
Find then brings it home by recursively {} running our commands, and draws to a close with the semi-colon ;.
Viola! With a little elbow grease, we were able to transform reams of text in to a format suitable for export elsewhere.
Cheers.