- Searching substrings in records.
- Posted by janesconference@gmail.com on June 27th, 2008
Hi to everybody, my question is simple but I still haven't found a
definitive answer.
Suppose I have a lot of records, each composed of a fixed number of
fields, like a database. Some of these fields are strings (or char
arrays, what you prefer).
I want to extract every record in my record collection whose string
field contain a given substring. Substrings can be matched only within
words, i.e. "foo" matches "you are a fool" but not "babyproof oo
programming languages" (I've come with the two most idiot examples a
human mind could conceive, I know).
The brute force solution would be iterate on every string field of
every record and perform a substring search on any of them.
If at least one field contains the substring, the record must be
returned.
The question: is there a smarter approach than this? Maybe indexing
somehow the records before performing the searches? And are there
decent (c/c++) libraries dealing with this specific problem?
Thanks in advance,
Cristiano.
- Posted by Jens Thoms Toerring on June 27th, 2008
janesconference@gmail.com wrote:
If you want to know if a substring is in one of the strings of
your records you have to go through them once - how would you
know otherwise? Now, if you would search for the same substring
again and again and the records don't change then, of course, it
might make sense to store the result somewhere and use the al-
ready computed result.
I have no clear idea what you mean by "indexing the records be-
fore performing the search". Unless you have some prior infor-
mations about the substrings that will be searched for all that
you could reasonably precompute is the lengths of the strings
in the records and don't search those strings that are shorter
than the string that gets searched for and similar, very general
informatione (like does the strings contain upper or lower case
characters and, if you know the language, if characters with a
very high or low chance to appear in a text are in the string).
But the question you should askyourself is if searching for sub-
strings is theperformance bottle neck in your application before
investing much time to optimize this aspect of it. Did you con-
sider sticking the records into a database and let the database
do the dirty work? That's probably the nearest match you will
get when looking for a library dealing with such problems.
Regards, Jens
--
\ Jens Thoms Toerring ___ jt@toerring.de
\__________________________ http://toerring.de
- Posted by janesconference@gmail.com on June 27th, 2008
Yeah, I did. I think I will use something like BerkeleyDB.
But I wanted to know, for scientifical curiosity, "how it should be
done".
I tought about,for example, extracting at once all the words in all
the string fields, store them along with the record(s) they belong to,
then substring-search on all the words instead of exploring all the
database everytime.
Does it make sense?
- Posted by janesconference@gmail.com on June 27th, 2008
And: yes.
This algorithm is part of an interface that searches through tagged
files, and the feedback should be quasi-instantaneous (there are lots
of apps that provide this feature, and I'd like to mimick this
behaviour).
- Posted by Jens Thoms Toerring on June 27th, 2008
janesconference@gmail.com wrote:
Depends on how much effort it is to find the strings in the records
and if there are many words in the strings that are identical. If
the strings in the database are made up of normal words that often
are repeated then making up a list of words might speed things up
(at the cost of extra memory). On the other hand if there are e.g.
serial numbers stored in the strings in the records (something
like "U34-12-JK-71") and they're all (or at least mostly) different
things change. So I guess a clear and simple answer is basically
impossible unless one knows a lot about the typical contents of the
strings and perhaps also has some reliable statistical data about
what's mostly gets searched for.
Regards, Jens
--
\ Jens Thoms Toerring ___ jt@toerring.de
\__________________________ http://toerring.de
- Posted by Bartc on June 27th, 2008
<janesconference@gmail.com> wrote in message
news:25813813-8e33-4982-8dfe-384929d2b234@t54g2000hsg.googlegroups.com...
This sounds like the sort of thing Google does. And so, yes, it would have
to use some very smart approaches.
I have no idea exactly how Google achieves it's speed. But it's possible to
make some guesses. For one thing, the search term you enter (your substring)
is probably not unique; someone else will already have searched for it!
And possibly each individual word of the search string will already have
been indexed (in a way to immediately know in which pages it's used).
If you only do a single search on your database, a linear search might as
well be done. For many different searches, then there will be better ways
than linear searching.
--
bartc
- Posted by janesconference@gmail.com on June 27th, 2008
You mean it's either the naive approach or the super-smart google
approach? Possible there's no quite-smart algorithmic approach? 
- Posted by Ed Prochak on June 27th, 2008
On Jun 27, 5:54 am, janesconfere...@gmail.com wrote:
That involves parsing all the strings twice: once to extract words and
once to do the substring search on each word.
assuming you merge the text portions of the records into rows of
RECORD ID (record number or whatever uniquely identifies your record)
MERGED TEXT (since words are key for you, merging fields with a space
between each would work)
then just a simple search works:
fetch next text record
scan merged text for foo (or whatever your search pattern is)
if match, log success on the record ID
The scan is implemented as a simple FSM (finite state machine). But
the idea is to read the data a few times as possible. You might even
specialize it even more by scanning for the pattern in the routine
that fetches the raw records, skipping the merge step. so then the
algorithm looks like
fetch next record
for each text field in the record
scan text for foo (or whatever your search pattern is)
if match, log success on the record ID
This could be programmed fairly directly in a PERL script. As is often
the case, the right tool/language for the job can make things much
easier.
This all assumes the search is fairly infrequent, like an ad hoc
search. If it is to be done on a regular basis, then building indices
might be of benefit. (consider for example the Oracle CONTEXT
package).
The solution depends on the real requirements.
HTH
ed
- Posted by janesconference@gmail.com on June 27th, 2008
Excuse me, but I just can't tell the difference between the algorithm
you introduced above and the brute force approach I described on the
first post:
"The brute force solution would be iterate on every string field of
every record and perform a substring search on any of them.
If at least one field contains the substring, the record must be
returned. "
Or in Python, but language doesn't matter, now I'm focused on the ways
to do it.
Yeah, it' an ad hoc search. I've seen commercial programs (like
Mixmeister on its library of mp3 tags, for example) doing this kind of
things so quickly that it implements a search-while-you-type dialog.
- Posted by Bartc on June 27th, 2008
<janesconference@gmail.com> wrote in message
news:ee916d1c-6a9a-46a8-b275-d64e4d02b612@x41g2000hsb.googlegroups.com...
You asked if there was something better than brute force; well, yes!
If you want something in-between, then I don't know a specific method; I
would have to look at your data and the kind of searches you want to make,
and see what can be done to achieve your goal.
You say your substrings are only substrings of complete words. Then you can
scan your data to form a list of all the words it uses; there will only be a
few thousand of these (even if the data fields are quoting Shakespeare).
Then a very quick check of this list will tell you if the substring is
present in the database, and if so it might tell you the first record it's
used in. While the user is looking at that, you can be searching for the
next record.
Or you might choose, for every word, to store a list of the records where
it's used. This could be a big list, but you will get the results very
quickly. (And you have to keep it up-to-date of course, but then your
database might be an archive.)
--
Bartc
- Posted by Bartc on June 27th, 2008
<janesconference@gmail.com> wrote in message
news:0ffe54b4-f15c-4a0b-b9da-12f7afbbf8ef@m44g2000hsc.googlegroups.com...
OK. What is a tagged file exactly? Are you searching the contents of these
tags? Are the tags, whatever they are, part of the file contents requiring
each file to be opened to retrieve them?
And how many files were involved?
Is it possible that the application you saw was cacheing whatever text is
being searched in a separate place?
--
Bartc
- Posted by janesconference@gmail.com on June 27th, 2008
On 27 Giu, 16:28, "Bartc" <b...@freeuk.com> wrote:
No, I guess that the files are read at the start and the tags are
extracted from them and put into some colections of records in memory.
They are mp3 tags, such as title, artist, album, compilation,
comments. Say ten - twelve text field per song.
in my case, about 10K. But a big collection could count up to 500K
files, I'd say.
Umm, I'm not sure of what you mean here. The records containing the
fields being searched are stored in RAM, for sure.
- Posted by Richard Harter on June 27th, 2008
On Fri, 27 Jun 2008 03:04:07 -0700 (PDT),
janesconference@gmail.com wrote:
It sounds as though you want to build a suffix tree. Google is
your friend.
Richard Harter, cri@tiac.net
http://home.tiac.net/~cri, http://www.varinoma.com
Save the Earth now!!
It's the only planet with chocolate.
- Posted by Bartc on June 27th, 2008
<janesconference@gmail.com> wrote in message
news:97355c0d-c3b4-4aa2-8ddc-d18c3688981d@l64g2000hse.googlegroups.com...
Well the tag data could have been collected into one file somewhere. And
once in RAM a search is near-enough instant, for 10K records, even with your
brute force method.
The question is how or when the collation from the separate data files takes
place. You would surely notice 10K files being opened and closed! It's
possible this is done in the background, with the first few searches being
slower until the tag data summary is complete.
--
Bartc
- Posted by janesconference@gmail.com on June 27th, 2008
On 27 Giu, 17:16, c...@tiac.net (Richard Harter) wrote:
Suffix trees seem useful.
But I'd have to build a suffix tree for every string field, then for
every search process all the records and inspect every string field's
suffix tree.
If I have a database of 200k files and ~10k words, maybe it would be
better to collect all the words, build a suffix tree for each one of
them and then search on all the words.
- Posted by janesconference@gmail.com on June 27th, 2008
On 27 Giu, 17:24, "Bartc" <b...@freeuk.com> wrote:
That's no great problem: the application check always at the start
what files have changed, and while it rebuilds its database, you can't
search.
However, How long does it take, to open & close (and parse tags of)
10k file, in your estimation?
- Posted by Richard Harter on June 27th, 2008
On Fri, 27 Jun 2008 08:48:27 -0700 (PDT),
janesconference@gmail.com wrote:
Yes. Incidentally, opening and closing files is relatively
expensive, so definitely want your indexing data in a separate,
relatively compact file.
Richard Harter, cri@tiac.net
http://home.tiac.net/~cri, http://www.varinoma.com
Save the Earth now!!
It's the only planet with chocolate.
- Posted by Bartc on June 27th, 2008
<janesconference@gmail.com> wrote in message
news:8f493179-cfcb-4904-9b91-750f99cc8bb6@m45g2000hsb.googlegroups.com...
OK, so not quite so instant as you said...
A quick test (on my low-end 4-yo PC) to create 10000 1-line text files took
27 seconds.
To read them back (open, read the one line, and close), took about 1 second.
Although it's possible disk cacheing helped here (it might also help your
appl however).
But I'm sure you can do this sort of test yourself, with the actual data!
Just open and close each of the files, and read a few hundred bytes from the
file. If the tag data requires reads from several places in the file, that
will slow things down a bit, but usually anything that happens in memory is
magnitudes faster than reading/writing from disk.
(If you are using C's fopen(), make sure you use "r"/"rb" mode not "w/wb"!)
--
Bartc
- Posted by Ed Prochak on June 27th, 2008
On Jun 27, 9:10 am, janesconfere...@gmail.com wrote:
An FSM search is not the same as you substring search.
Consider looking for "sea" in the previous sentence. you brute force
solution is to take successive substrings like
"An ", "n F", " FS", "FSM"...
which involves checking each character of the input string N times,
where N is the length of the search string.
This algorithm is order N*M ( where M is the length of the input
string)
An FSM would look at each character of the input string once. So my
algorithm is order M.
Language does matter. Programming this in C would take many more lines
of code. Given the maxim that there are X bugs per line of code, the
PERL or PYTHON programs will be faster to write and faster to debug.
But your point is taken, a poor algorithm in a great language will
still run poorly.
Then again the language matters. How often would you choose C over
PYTHON for an ad hoc program?
HTH,
ed
- Posted by janesconference@gmail.com on June 27th, 2008
That's ok, thanks, justo to have a magnitude order.
Actually, it'll be a library to do that for me, so I expect a slight
overhead. However, it has to be done once, @ the start of the
application, so it's only startup time.
Yeah :-)