Seek and Destroy
Sed
--the stream editor--can read, then modify text on the fly. While the syntax for sed
appears cryptic at first blush, it is a versatile text-processing tool worth adding to your UNIX utility belt.
In the real world, one could use sed
in combination with the find utility to recursively locate, then remove: 1) all HTML elements, and 2) all leading whitespace from the .php files of a website.
Here's one approach to that task:
#!/usr/bin/env bash
find ./ -type f -name '*.php' \
-exec sed -i 's/<[^>]*>//g; s/^[ \t]*//g ' {} \;
Let's take a look at the find
portion of the snippet first. For reference, the syntax for this command is:
find [location to search] [expression] [-option(s)] [what to find]
In our example, we tell find
to begin from the present working directory ./
, and search only for files -type f
that end with the '*.php'
extension. We then instruct find
to -execute
two (2) sed
commands.
Ouch! Those sed
commands look evil. We'll break them down, but before we do, pay attention to the syntax for the command:
sed Options... [script] [inputfile]
and note the following:
-i
= in-place edit
s
= substitute
g
= global
//
= delete
Our first expression:
s/<[^>]*>//g
- looks for an opening tag
<
- which is followed by zero or more characters
*
, which are NOT^
a closing tag>
- then looks for a closing tag
>
replacing matches with nothing.
Our second expression:
s/^[ \t]*//g
- looks for lines that start
^
with tabs\t
replacing matches with nothing.
Find
then brings it home by recursively {}
running our commands, and draws to a close with the semi-colon ;
.
Viola! With a little elbow grease, we were able to transform reams of text in to a format suitable for export elsewhere.
Cheers.