- desktop search engine to search several .pdf files at once?
- Posted by Achim Nolcken Lohse on February 16th, 2004
I need a desktop search engine that will run under Win98SE and can do
work searches (preferably boolean) on several .pdf files at once.
I've got a 600MB CD with 45 pdf files in a dozen nestled folders that
I need to search, and am looking for a more efficient method than
opening each file separately and searching it.
Is there any freeware that can do this? Ideally, it should have the
ability to produce a text file with the locations of the hits, and
preferably, some of the context.
Achim
axethetax
- Posted by Sietse Fliege on February 17th, 2004
Achim Nolcken Lohse wrote:
InfoRapid Search & Replace
http://www.inforapid.com/html/searchreplace.htm
http://www.inforapid.org/sr/sr.exe
See also these quotes from their pages :
Full text search in Html, Rtf, Pdf, WinWord and Excel files and
many other file formats
Before you can search PDF files with InfoRapid, you first have to
copy the freeware program pdftotext.exe, which is part of the XPDF
package, into the installation directory of InfoRapid Search & Replace.
http://www.inforapid.org/se/xpdf.zip to download a zip archive with the
program pdftotext.exe.
pdftotext.exe can convert only such PDF documents which are NOT
protected by a password. Password protected PDF files are therefore
searched and displayed as binary data. Many old PDF documents contain
LZW compressed text and pictures, which must be decompressed before the
search. In order to do so, pdftotext.exe needs the freeware program
gzip.exe, which must be also copied into the installation directory of
InfoRapid Search & Replace. gzip is available here: www.gzip.org
--
Cheers,
Sietse Fliege
- Posted by Son Of Spy on February 17th, 2004
Sietse Fliege wrote:
Another Option:
Search PDF
SearchPDF will scan a selected directory branch for PDF files and search
each PDF for a given text string. it display all matching and non matching
PDF files found in two list boxes. You can then open any selected matching
PDF file from within the program. Like MakePDF, it uses AFPL Ghostscript to
do all the hard work. In order to use the program, you must have
Ghostscript installed. This program also uses the PSTOTXT package, written
by Paul McJones and Andrew Birrell of Digital Equipment Corporation's
Systems Research Center (see the file pstotext.txt which is included in the
downlaod zip file for licencing details).
http://www.lexacorp.com.pg/soft/searchpdf11.zip ~90Kb
Download GhostScript:
http://unc.dl.sourceforge.net/source...t/gs813w32.exe ~7.8 Mb
Cheers!
SOS
--
Some You Won't Find Anywhere Else...
http://www.sover.net/~wysiwygx/index.html
. --- . . - - - - - - - - - - - -
/ SOS \ __ / Freeware - - - - - -
/ / \ ( ) / - - - - -
/ / / / / / / \/ \ - - - -
/ / / / / / / : : - - -
/ / / / / ' ' - -
/ / //..\\
=====UU==UU=====
'///||\\\'
' '' '
- Posted by Achim Nolcken Lohse on February 17th, 2004
On Mon, 16 Feb 2004 07:46:46 GMT, lohsea@3web.nettax (Achim Nolcken
Lohse) wrote:
will probably try the InfoRapid approach first, as it seems a bit
simpler. Will let you know how it goes.
Achim
axethetax
- Posted by Achim Nolcken Lohse on February 17th, 2004
On Tue, 17 Feb 2004 02:16:58 +0100, "Sietse Fliege"
<change_.invalid_to_.nl@sf.slownet.invalid> wrote:
install), including putting pdftotext.exe in the serapid directory,
but no joy.
Inforapid recognized the presence of pdftotext (Acrobat Reader shown
in the external viewers drop down menu), and duly searched the pdf
file in the designated directory, but couldn't find a single hit. The
search was the simplest possible - "hunting" in a single pdf file,
case sensitivity was unchecked, and "whole word only" was also
unchecked. The file contains 9 instances of this word, but Inforapid
found none.
Worse - the program is not well-behaved. I initially set it to search
a folder containing 120 pdf files amounting to 64MB of material. The
display screen turned off during the search, and when I tried to
reactivate it by hitting a key on the keyboard, I got no response.
After trying some specific keys, the display came back, but was
shimmering, fuzzy, and showed three desktops side by side. Then the
system reset. I'm guessing the system ran out of memory.
Achim
axethetax
- Posted by Sietse Fliege on February 18th, 2004
Achim Nolcken Lohse wrote:
Yes, the install is quite straightforward.
The only thing might be that you did not mention gzip.exe.
This should also be put in the serapid directory, in case of old PDF
documents containing LZW compressed text/pictures.
Apparently you also have 'Use external converters' checked. 
Bummer. I can't really think of anything. 
I am puzzled. It behaves well on my system, XP, 2G, 256 MB.
You can let it build a cache which makes it fast.
I just did a search for the word 'windows' in 249 pdf's, totaling 231
MB, and the search finished in 5.50 sec.
Perhaps you first searched through pdf's on a CD, which might complicate
things, memory-wise.
Maybe you could try deleting the cache (seCache.tmp), then start a new
search through one pdf file on the hard disk.
You should at least be able to get correct results searching txt files.
The author, Ingo Straub, has occasionally answered a question in this
group. You might want to try and e-mail him.
--
Cheers,
Sietse Fliege
- Posted by Achim Nolcken Lohse on February 18th, 2004
On Wed, 18 Feb 2004 03:28:56 +0100, "Sietse Fliege"
<change_.invalid_to_.nl@sf.slownet.invalid> wrote:
Yes, but not likely the case here. These pdf files were only recently
created by a commercial outfit.
Just to be sure, I installed gzip. The instructions, unfortunately,
are not clear. Gzip wants to install itself in its own directory. I
put the directory in the seRapid folder. The inforapid instructions
say to simply copy the executable into the same folder as inforapid,
but which, there are several? So for good measure, I then copied
gzip.exe and gunzip.exe from their folder into the seRapid folder.
It made no difference.
Well, my box is in a different category Pentium 75MHz, 128MB, but I
didn't see any system requirements listed on the site.
Yes, the pdfs were on a CD. So following your suggestion, I searched
them on the hard drive - same result.
Makes no difference either.
Ok. Text file searches worked.
And then I tried searching some other pdf files, and found that
InfoRapid works on some, not on others. So perhaps it depends on the
version of Acrobat used to create the files?
The program shows "memory used" in the progress bar before and after
search sessions. On my system, it starts at about 40%, in yellow, and
then turns red at about 60%. But maybe it's not monitoring while doing
the search, resulting in a system crash?
OK. I'll send him a copy of this post, and offer to send on a couple
of the non-searchable pdf files, if he's interested.
Achim
axethetax
- Posted by Sietse Fliege on February 18th, 2004
Achim Nolcken Lohse wrote:
I copied both as well. I believe only gzip.exe is required, though.
You can also try the latest version (3.0) of pdf2txt.exe.
http://www.foolabs.com/xpdf/
ftp://ftp.foolabs.com/pub/xpdf/xpdf-3.00-win32.zip (1141565 bytes)
Does your system keep crashing with it?
--
Regards,
Sietse Fliege
- Posted by Achim Nolcken Lohse on February 18th, 2004
On Wed, 18 Feb 2004 06:01:07 +0100, "Sietse Fliege"
<change_.invalid_to_.nl@sf.slownet.invalid> wrote:
Good idea. The one I got from the inforapid link was V2.02. But it
doesn't seem to make any difference.
Only on the one large search on the CD so far. So after installing
xpdf 3.0, I tried searching on a large chunk of the CD again - about
385MB. The search dragged on for 28 minutes, the progress bar showed
it about 75% complete. It was working on the last folder of pdf files,
all of them maps.
Then the screen went black, and like the previous time, the keyboard
wouldn't bring the display back, but moving the mouse did. Again, the
desktop showed in triplicate, too shimmery and unfocused to be
readable. I managed to get a drop down menu, but couldn't read it. And
then the system reset again.
The only difference from the previous attempt was that I got one hit,
and two "sort ofs", on one, the context was garbage characters, and on
the other, only the file was shown, not the target word or its
context.
So no joy.
BTW - a 64MB search on my hard drive only took about a minute by
comparison. But they weren't the same files. I have a hunch the map
files are slowing things down.
Achim
axethetax
- Posted by Sietse Fliege on February 18th, 2004
Achim Nolcken Lohse wrote:
I read in pdftotext.txt in the latest xpdf distribution:
"BUGS
Some PDF files contain fonts whose encodings have been mangled beyond
recognition. There is no way (short of OCR) to extract text from these
files."
I guess that this might account for what happens w.r.t. the maps.
But with 'normal' pdf's results should be correct (maybe provided an
eventual corrupt cache has been deleted).
Other than that : beats me.
InfoRapid also worked fine on my win95 box, P166, 72 MB.
I hope it'll work out in the end for you.
I like the program (which is also Pricelessware).
--
Cheers,
Sietse Fliege
- Posted by Achim Nolcken Lohse on February 18th, 2004
On Wed, 18 Feb 2004 13:31:29 +0100, "Sietse Fliege"
<change_.invalid_to_.nl@sf.slownet.invalid> wrote:
....
Yes, it accounts also for one of the non-readable pdfs I tested, which
seems to be nothing but a huge image file.
Yes. It's definitely not a cache problem, nor an OCR problem, because
in several of the test pdfs, AR's search engine had no trouble finding
the target words.
It gets worse. I just tried running it on my top machine, an AMD K6-2
500MHz with 384MB of RAM. I copied the whole SERapid folder onto a
Zip, sneakered it over to the AMD, copied it onto the D: drive, and
ran it.
It started up, but only showed the top two fields in the Search
Diaglog window. So I was able to change the file type, but not the
directory to search , or the type of search, as these fields didn't
display.
I tried adding the path in front of the file type in the file type
field, but that did nothing. The only directory InfoRapid would search
is C:\!
I went through my files to see if Inforapid required special
installation, but couldn't find any references. I believe it did use
an installer, but didn't seem to do much except copy everything into
its directory undr \Program Files, and add an icon to the startup
menu.
In any case, its not obvious why it would search only C:\ when its
installed on D:\.
(This is why I have 20 untried freeware programs sitting on my hard
drive for every one that I've actually installed :/)
Achim
axethetax
- Posted by Sietse Fliege on February 19th, 2004
Achim Nolcken Lohse wrote:
As it also ran fine on my Win95 box and no special requirements are
mentioned, I'ld assume that there are none.
But your problems put that somewhat in doubt.
It looks like seRapid is written in Visual C++ v7.0.
AFAIK that does not necessarily mean seRapid also depends on e.g.
msvcp70.dll. I might be wrong but it looks like seRapid "only" needs
e.g. msvcp60.dll. I hope the author will help you out, here.
I did not monitor the install, but had a look in its INSTALL.LOG.
It suggests that for a completely proper installation, setup is required
rather than just copying files.
It looks like in the latter case at least SEStart.dll and non-critical
functions, like contextmenu and printing search results might not be
properly registered.
I'm out of my depth and hope that the author will help.
--
Cheers,
Sietse Fliege
- Posted by Achim Nolcken Lohse on February 19th, 2004
On Thu, 19 Feb 2004 15:44:13 +0100, "Sietse Fliege"
<change_.invalid_to_.nl@sf.slownet.invalid> wrote:
Thanks, I should have looked it up myself - I use Quarterdeck
Cleansweep to monitor my installs. I'll try doing a full install and
see if it makes a difference to the way the program runs.
I may have to e-mail him again, because I got a cryptic message from
my ISP that handles incoming telling me a post of mine had been
rejected due to excessive length. But it gave no hint as to what the
message or who the addressee was, and the date stamp didn't correspond
to anything I have in my logs (I use one ISP for sending mail and
another for receiving, which complicates things a bit). Strangely, I
got no error message from the sending ISP.
regards,
Achim
Achim
axethetax
- Posted by Sietse Fliege on February 20th, 2004
Achim Nolcken Lohse wrote:
FWIW: I happen to have e-mailed him about a month ago (about a new
program that he released in German language only.)
Got an answer within two days.
--
Cheers,
Sietse Fliege
- Posted by Susan Bugher on February 20th, 2004
Achim Nolcken Lohse wrote:
FWIW I have a hunch you may be right. My impression is that InfoRapid
S&R is much slower with large files. Replacing a lot of text items in
one moderately large file is a slow process on my machine (Win98 PIII
128 MB RAM). Great program though IMO. 
Susan
--
Pricelessware: http://www.pricelessware.org
alt.comp.freeware FAQ: http://clients.net2000.com.au/~johnf/faq.html
http://www.pricelessware.org/2004/proposal-CD.htm
http://www.pricelessware.org/2004/pr...egoryIndex.php
- Posted by Ingo Straub on February 22nd, 2004
Hello Achim,
Sorry, but I haven't received your e-mail. I think my postbox is
limited to 5 MB. Please don't send me such large e-mail without asking
me before.
Have you tried to copy the PDF files on your harddisk and search them
there? If that helps then it's a problem with Windows 98 and your
CD-ROM driver. I have seen similar problems with other programs. It
occures when the PC is busy with calculating some results and when it
doesn't access the CD-ROM drive for some time. Then the CD stops
spinning and next time when the PC tries to access the CD-ROM drive, a
blue screen appears and the system must be rebooted.
If that doesn't help then you can try the following: Make sure that
the options "Use internal converters for HTML and RTF files" and "Use
external converters" on the tab CONVERTERS are checked. On the tab
SEARCH you have to leave the field SEARCH FOR empty. Press the Start
button. Then InfoRapid displays a list of all PDF files in your search
directory. When you click on one of the file names, then InfoRapid
shows the text which was returned by PDFTOTEXT, after it has converted
your PDF file into a text file. Every word which is contained in this
text file can be found by InfoRapid. If your SEARCH WORDS are not
contained in this text, then they will never be found.
Regarding your second problem: You can resize the search dialog with
the tracker at the top of the search bar. Just track it upwards with
the left mouse button held down and you will find the missing input
fields.
Best Regards
Ingo
- Posted by Achim Nolcken Lohse on February 23rd, 2004
On 22 Feb 2004 12:15:06 -0800, info@inforapid.de (Ingo Straub) wrote:
Sorry Ingo, I didn't realize how large the files were until too late.
I believe there were two at about 1.7MB, and one or two much smaller
ones.
Yes, it makes no difference to the ones that get no hits. Of course
the search is much faster.
.....
This approach displays a screen of garbage characters for the most
part. Some of the pdf files in the batch I'm trying to search can't be
searched because they're simple image files. They couldn't be searched
by AR's internal search mechanism either. However, there are other
which are searchable, but which InfoRapid/pdftotext can't decipher.
Unfortunately, they're mostly very large files of one Megabyte and up.
I'll try to find a smaller one.
this, and managed to get the same single hit I did before. Then I
tried the same search with another word, and the system locked up
again.
Will contact you by e-mail when I have more meaningful information.
Achim
axethetax
- Posted by Ingo Straub on February 24th, 2004
Hello Achim,
The problem is that the files you can't search with InfoRapid are copy
protected. When you generate PDF documents, you can choose if other
people should be allowed to copy the text of not. PDFTOTEXT respects
this flag and doesn't convert the PDF file into a text file.
Best Regards
Ingo