Tech Support > Operating Systems > Linux / Variants > sed: ignore html in search and replace?
sed: ignore html in search and replace?
Posted by Ian Gil on November 28th, 2003


I have to search and replace text in an html file without corrupting
the html, can this be done with sed, or anything else?


Ian

Posted by Pascal Bourguignon on November 28th, 2003


Ian Gil <i@NOSPAMALLOWED.com> writes:

Yes or yes.


--
__Pascal_Bourguignon__ http://www.informatimago.com/
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~
Living free in Alaska or in Siberia, a grizzli's life expectancy is 35 years,
but no more than 8 years in captivity. http://www.theadvocates.org/

Posted by Alan Connor on November 29th, 2003


On Fri, 28 Nov 2003 23:31:56 GMT, Ian Gil <i@NOSPAMALLOWED.com> wrote:
From: Alan Connor <zzzzzz@xxx.yyy>
Subject: Re: sed: ignore html in search and replace?
References: <slrnbsfnio.k4.i@jal.net>
Reply-To: xxxx@yyy.zzz
Followup-To:

On Fri, 28 Nov 2003 23:31:56 GMT, Ian Gil <i@NOSPAMALLOWED.com> wrote:
Yes and yes. Sed is great tool. Post a short example of what you want
to do and myself or someone else will show you how it's done.

The basic search and replace script for sed is:

sed 's/search/replace/g' inputfile > outputfile ; mv outputfile inputfile.

Ed, which sed is derived from, is also very effective in scripts or
interactively, modifies the orginal file and is a lot less sensitive
to special characters.

Open the file in ed and type

,s/search/replace/g
w
q

And it's done. In a script it can be done like so:

#!/bin/sh

ed file <<PPP #arbitrary choice of string that won't occur in file
,s/search/replace/g
w
q
PPP

exit 0

I'm using ed right now, as my Usenet pager and editor.

(it lets me read the posts of possible trolls one line at a time,
then if I encounter any abuse I'm outta there, or delete the rest
unseen with +1,$d and reply once --- they hate it :-)


AC


Posted by Ian Gil on November 29th, 2003


Yes, that's all I want to do a, a simple search and replace but it will be done
globally and there'll be many of them to do, and since html has html tags and
script areas in it I need to have them ignored.

For example, I might have all instances of 'title' replaced by 'heading'
but if <title> is replaced by <heading> it will corrupt the html.


Ian

Posted by Alan Connor on November 29th, 2003


On Sat, 29 Nov 2003 01:38:47 GMT, Ian Gil <i@NOSPAMALLOWED.com> wrote:
That isn't making any sense. Please post an actual example of exactly what
you want to do.

No way to deal with this in the general terms that you seem to think it
can be done. Computers are dumb as bricks and need to be told exactly
what to do.

Not going to ask again.

AC

Posted by Ian Gil on November 29th, 2003


I need this command:

sed 's/title/heading/g' file.html

to work normally just as you see it there EXCEPT within html tags.

So I don't want "<title>" replaced by "<heading>" because "<title> is an
html tag, and I don't want any html tags changed.



Ian


Posted by Alan Connor on November 29th, 2003


On Sat, 29 Nov 2003 02:54:50 GMT, Ian Gil <i@NOSPAMALLOWED.com> wrote:
[ please bottompost ]



PLEASE pay attention! Post an ACTUAL example of the sourcecode you want to
edit, then the same sourcecode that has been edited.

Or go away.

AC


Posted by Ed Murphy on November 29th, 2003


On Sat, 29 Nov 2003 03:58:55 +0000, Alan Connor wrote:

Are you really so daft that this isn't clear enough for you? Heck, the
original post was plenty clear for me.

Okay, okay, here's your baby food:

----- begin old.html -----
<html> <head> <title> This is the title of the document </title>
<body> This is the body of the document </body> </html>
------ end old.html ------

----- begin new.html -----
<html> <head> <title> This is the heading of the document </title>
<body> This is the body of the document </body> </html>
------ end new.html ------


Posted by Dan Espen on November 29th, 2003


Ian Gil <i@NOSPAMALLOWED.com> writes:

One approach:

's/title>/saveme>/g'
's/title/heading/g'
's/saveme>/title>/g'

It might be easier to fire up emacs, or some other editor
and use an interactive search/replace.

Posted by Pascal Bourguignon on November 29th, 2003


Ian Gil <i@NOSPAMALLOWED.com> writes:

Don't be silly, any occurence of "entitled" would be transformed to
"enheadingd" then! When you want to run substitutions you always
include some context:

sed -e '/\(prefix context\)\(old\)\(suffix context\)/\1new\3/g'


--
__Pascal_Bourguignon__ http://www.informatimago.com/
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~
Living free in Alaska or in Siberia, a grizzli's life expectancy is 35 years,
but no more than 8 years in captivity. http://www.theadvocates.org/

Posted by Alan Connor on November 29th, 2003


On 29 Nov 2003 05:40:48 +0100, Pascal Bourguignon <spam@thalassa.informatimago.com> wrote:
Forgot the s :-)

Actually, with the best-case-scenario, these would do the job:

's/abc /def /g'
's/ abc/ def/g'

sed -n -e 's/abc /def /g' -e 's/ abc/ def/g' inputfile > outputfile
mv outputfile inputfile

But that would be extremely lucky. And if he has a lot of strings needing
replacing, he'll want to feed them to sed -n from a file, rather than writing
all those sed -nscripts.

It's possible that he could isolate the sections without the tags and just
work on them:

sed -n '/RE/,/RE/{s/abc/def/g;p;}' inputfile etc...


AC


Posted by P.T. Breuer on November 29th, 2003


Ed Murphy <emurphy42@socal.rr.com> wrote:

Then he wants to go for

sed 's/\([^<]\)title/\1heading/g' file.html
sed 's/title\([^>]\)/heading\1/g' file.html
sed 's/title$/heading/g' file.html
sed 's/^title/heading/g' file.html

Peter

Posted by Chris F.A. Johnson on November 29th, 2003


On Sat, 29 Nov 2003 at 01:38 GMT, Ian Gil wrote:
This is not a problem that can easily be solved in a script. It
is analogous to removing comments from a C source file. There are
script solutions, but they are hard to understand, and
consequently not easy to modify when your needs change. Or else
they are very slow.

My approach would be to put the tags on lines of their own, since
whitespace is ignored when rendering HTML, and pipe the result
through an awk script which would print lines containing tags
unchanged, and perform substitutions on the other lines.

I would write a program in C, to first convert all line endings
to spaces, then insert newlines before "<" and after ">". The
output would be piped to awk:

htmlfilter < file.html |
awk '/^</ { ## print HTML tag and proceed to the next line
print
next
}

NF { ## perform substitutions on non-blank lines
gsub("left","right")
gsub("title","heading")
gsub("bold","emphasized")
## add substitutions here

## then print the line
print
}' > newfile.html


A basic filter is quite simple:

/* htmlfilter.c */
#include <stdio.h>
int main (void)
{
int c;
int n = 0;
while ( (c = getchar()) != EOF )
{
if ( c == '\n' || c == '\r') c = ' ';
if ( c == '<' )
{
printf("\n");
putchar(c);
n = 0;
}
else if ( c == '>' )
{
puts( ">" );
n = 0;
}
else
{
putchar(c);
}
++n;
}
putchar('\n');
return 0;
}
/* EOF */

This will do for many HTML files, but some elements of a file
should not be reformatted.

Dealing with <pre> blocks makes the program more complicated, and
shows the way to handling any other parts that need special
attention. This version doesn't put tags on a separate line if
they occur within a <pre> block.


/* htmlfilter2.c */
#include <stdio.h>

int main (void)
{
int c;
int n = 0;
int pre = 0;
char string[256]; /* there's no checking for overflow */

while ( (c = getchar()) != EOF )
{
/* convert \n and \r to spaces if not in a <pre> block */
if ( pre == 0 && (c == '\n' || c == '\r') ) c = ' ';

if ( c == '<' )
{
pre = 0;
n = 0;
}
string[n] = c;
string[n+1] = 0;

if ( pre == 1 )
{
putchar(c);
if ( !strcmp(string,"</pre") )
{
pre = 0;
n = 0;
string[n] = '\0';
}
}
else if ( c == '<' )
/* put a linefeed before the beginning of HTML tag */
{
printf("\n%c", c);
n = 0;
string[n] = c;
string[n+1] = 0;
}
else if ( c == '>' )
/* put a linefeed after the end of HTML tag */
{
puts( ">" );
n = 0;
}
else if ( c == ' ' )
{
if ( n > 74 )
/* if line is longer than 74 characters, break at a space */
{
putchar('\n');
n = 0;
}
else if ( n > 0 )
{
putchar(c);
}
}
else
{
putchar(c);
}

++n;
if ( !strcmp(string,"<pre") )
{
pre = 1;
n = 0;
string[n] = '\0';
}
}
putchar('\n');
return 0;
}
/* EOF */

The same thing can be written in a bash script, but it is very
slow, especially on large files:


#!/bin/bash

print() {
if [ -n "$line" ]
then
if [ $pre -eq 1 ]
then
printf "%s\n" "${line%["$NL$CR"]}"
else
case $line in
'<'*) printf "%s\n" "${line%["$NL$CR"]}" ;;
*) ####### substitutions go here #######
line=${line//right/left}
line=${line//bold/emphasized}
line=${line//title/heading}
#######################################
printf "%s\n" "${line%["$NL$CR"]}"
;;
esac
line=
fi
fi
}

NL=$'\n'
CR=$'\r'
line=
pre=0
n=0

while IFS= read -d '' -rn1 c
do
case $line in
"<pre") pre=1 ;;
"</pre") pre=0 ;;
esac
if [ $pre -eq 1 ]
then
line=$line$c
case $c in
"$NL")
print
line= ;;
esac
else
line=${line# }
case $c in
"$NL"|"$CR") line="${line% } " ;;
'<') print
line=$c
;;
'>') line=$line$c
print
line=
;;
' ')
if [ ${#line} -gt 74 ]
then
print
line=
else
[ -n "$line" ] && line="${line% } "
fi
;;
*) line="${line}${c}" ;;
esac
fi
done
print
## EOF ##


These have not been thoroughly tested, but they may provide a
base on which to work.

--
Chris F.A. Johnson http://cfaj.freeshell.org
================================================== =================
My code (if any) in this post is copyright 2003, Chris F.A. Johnson
and may be copied under the terms of the GNU General Public License

Posted by Ian Gil on November 29th, 2003


I figured things might start to get hairy...and you're just the barber I
was looking for - thanks!

Just one thing about the awk part, I'd like to only replace complete
words; so in the other scripts I was able to sandwich the words between
the regex \b to match only complete words but I wasn't able to do that
in this awk line:

gsub("title","heading")


[..in other posts..]

This appeared to work
If so I wasn't able make it work. I'd used "<" and ">" in place of
"prefix context" and "suffix context".


Thanks,

Ian

Posted by Anonymous on November 29th, 2003


"IG" == Ian Gil <i@NOSPAMALLOWED.com>:
IG> Yes, that's all I want to do a, a simple search and replace but it will be done
IG> globally and there'll be many of them to do, and since html has html tags and
IG> script areas in it I need to have them ignored.
IG>
IG> For example, I might have all instances of 'title' replaced by 'heading'
IG> but if <title> is replaced by <heading> it will corrupt the html.

The HTML tags can be lowercase, uppercase, or mixed case. 'sed' is not
very well equipped to handle changes while ignoring case. Use 'awk' or
'perl' instead.

-=-
This message was posted via two or more anonymous remailing services.




Posted by Anonymous on November 29th, 2003


"IG" == Ian Gil <i@NOSPAMALLOWED.com>:
IG> Just one thing about the awk part, I'd like to only replace complete
IG> words; so in the other scripts I was able to sandwich the words between
IG> the regex \b to match only complete words but I wasn't able to do that
IG> in this awk line:
IG> gsub("title","heading")
IG> If so I wasn't able make it work. I'd used "<" and ">" in place of
IG> "prefix context" and "suffix context".

You should have used '\\<' and '\\>' instead. E.g.:

$ seq 20|awk '{gsub("\\<2","#");print $0}'
1
#
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#0

-=-
This message was posted via two or more anonymous remailing services.




Posted by Chris F.A. Johnson on November 29th, 2003


On Sat, 29 Nov 2003 at 12:24 GMT, Ian Gil wrote:
gsub(/\<title\>/,"heading")


--
Chris F.A. Johnson http://cfaj.freeshell.org
================================================== =================
My code (if any) in this post is copyright 2003, Chris F.A. Johnson
and may be copied under the terms of the GNU General Public License

Posted by Alan Connor on November 29th, 2003


On Sat, 29 Nov 2003 12:24:14 GMT, Ian Gil <i@NOSPAMALLOWED.com> wrote:
Killfiled for 90 days

People: If he won't post a sample of the HTML source is working with,
then he is up to no good.

You are probably aiding and abetting a script-kiddie.

AC


Posted by Alan Connor on November 29th, 2003


On Sat, 29 Nov 2003 19:32:20 GMT, Alan Connor <zzzzzz@xxx.yyy> wrote:
Go it: This person is probably stealing the HTML source that someone
worked their butt off to create.

THAT'S why he won't post it.

AC


Posted by Ian Gil on November 29th, 2003


You're a strange bloke AC, but I don't like to hear you
cry so here you are; it is not any one particular html, if that
were the case I'd simply edit the source with an editor. The source
will be german web pages which will translate and highlight key words
on the fly, for my own viewing, on my browser, via lua extensions
which allow that sort of thing - pretty cool eh - it bascially allows
you to manipulate your web browsing on the fly. So do a search for
'das baby' at google.de (pun intended) and digest that html as the
baby food you so require.


Similar Posts