Tech Support > Computers & Technology > Programming > a question on google's algorithm
a question on google's algorithm
Posted by %NAME% on February 20th, 2007


The title is not exactly a disnomer, so pleaes let me explain...

Say I have a large number of 0-1 vectors containing only
bits corresponding to 0 or 1. Now given an arbitrary 0-1 vector,
I wish to find all the 0-1 vectors that are subsumed by that query bit
vector.

For example, given bit vector query as 01100, i wish to find all bit
vectors that whose 1st, 4th, 5th bits are also zero, leaving
the 2nd, 3rd bits as either 0 or 1.

What I wish to do is to build an index on the bit vectors, so that I
could avoid checking each 0-1 vector in sequence. In another word, I
hope this index could help me quickly filter away those vectors that
guarunteed not to be subsumed by the given 0-1 vector.

I am not sure this is the right place to ask, but the index could be
used to solve a search engine problem: the 0-1 vector
corresponds to my keywords search (a 0 means that keyword NOT in my
query, while a 1 means the keyword is in), and all the webpages
corresponds to 0-1 vectors. I wish to find all the webpages that
subsumes my query.

If anyone happen to know how Google solves this problem, and it is not
confidential, pls let me know, thanks a bunch!!!!

Posted by user923005 on February 21st, 2007


On Feb 20, 4:56 pm, "%NAME%" <hbccan...@gmail.com> wrote:
How about a bit vector in the form of a quadtree sort of thing (but in
this case an n-tree).

C & C++ groups removed.

If you want to know how the Google search works, use Google. No. I'm
not joking. The people who developed the google search engine wrote a
paper on it. They published it.

Here's the view from 3000 feet:
http://www.google.com/technology/

Details:
http://infolab.stanford.edu/~backrub/google.html
http://infolab.stanford.edu/pub/papers/google.pdf
http://maya.cs.depaul.edu/~classes/e...apers/brin.pdf

This is also interesting:
http://www.cs.uu.nl/docs/vakken/mr/c...r-indexing.ppt


Posted by Alf P. Steinbach on February 21st, 2007


* %NAME%:
No C++ content means OFF-TOPIC in [comp.lang.c++].

Thank you.

FUT: [comp.programming]

--
A: Because it messes up the order in which people normally read text.
Q: Why is it such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

Posted by Bill Reid on February 21st, 2007



%NAME% <hbccanada@gmail.com> wrote in message
news:1172019379.125896.186810@a75g2000cwd.googlegr oups.com...


---
William Ernest Reid




Posted by mensanator@aol.com on February 21st, 2007


On Feb 20, 6:56 pm, "%NAME%" <hbccan...@gmail.com> wrote:
Suppose I have a database of all the words from the
Official Scrabble Players Dictionary (OSPD). I can
create a consonant/vowel vector for all 5-letter
words using the SQL statement:

SELECT OSPD_words.OSPD,
IIf(Mid$([OSPD],1,1) Like "[a,e,i,o,u]",1,0) AS bit1,
IIf(Mid$([OSPD],2,1) Like "[a,e,i,o,u]",1,0) AS bit2,
IIf(Mid$([OSPD],3,1) Like "[a,e,i,o,u]",1,0) AS bit3,
IIf(Mid$([OSPD],4,1) Like "[a,e,i,o,u]",1,0) AS bit4,
IIf(Mid$([OSPD],5,1) Like "[a,e,i,o,u]",1,0) AS bit5,
[bit1] & [bit2] & [bit3] & [bit4] & [bit5] AS vector
FROM OSPD_words
WHERE (((Len([OSPD]))=5));

which produces:

OSPD bit1 bit2 bit3 bit4 bit5 vector
added 1 0 0 1 0 10010
adder 1 0 0 1 0 10010
addle 1 0 0 0 1 10001
adeem 1 0 1 1 0 10110
adept 1 0 1 0 0 10100
adieu 1 0 1 1 1 10111
adios 1 0 1 1 0 10110
adits 1 0 1 0 0 10100
adman 1 0 0 1 0 10010
admen 1 0 0 1 0 10010
admit 1 0 0 1 0 10010
admix 1 0 0 1 0 10010
adobe 1 0 1 0 1 10101
adobo 1 0 1 0 1 10101
..
..
..

Once I have those vectors, I can then query them to
find anything that has the 1st, 4th & 5th letters
as consonants and the 2nd & 3rd letters can be either
consonants or vowels. This is done using the
like "0??00" pattern matching criteria. The question
marks are single character wild cards:

SELECT vector_test1.OSPD, vector_test1.vector
FROM vector_test1
WHERE (((vector_test1.vector) Like "0??00"));

which filters the above output to show only the
requested vectors:

OSPD vector
baals 01100
backs 01000
baddy 01000
badly 01000
baffs 01000
baffy 01000
baggy 01000
bahts 01000
bails 01100
bairn 01100
baith 01100
baits 01100
..
..
..
brach 00100
bract 00100
brads 00100
brags 00100
braky 00100
brand 00100
..
..
..

And because we didn'y define "y" as a vowel, there
are even some 5-letter words with no vowels:

OSPD vector
ghyll 00000
lymph 00000
hymns 00000
gypsy 00000
crypt 00000
dryly 00000
lynch 00000
synth 00000
glyph 00000
synch 00000


Posted by Ben Pfaff on February 21st, 2007


"user923005" <dcorbit@connx.com> writes:

Don't forget about this one:
http://www.google.com/technology/pigeonrank.html

--
Ben Pfaff
blp@cs.stanford.edu
http://benpfaff.org

Posted by WTF on February 21st, 2007


Next time, not using your send button.



Posted by Jerry Coffin on February 21st, 2007


In article <1172019379.125896.186810@a75g2000cwd.googlegroups .com>,
hbccanada@gmail.com says...
Assuming your bit-vectors were stored in an integer type, you could use
a comparison like this:

if ((bit_vector | pattern) == pattern)
// found
else
// not found

If you need to store larger bit-vectors than will fit in any integer
type (only 32 are guaranteed), you can look at std::bit_vector and/or
std::vector<bool>. IIRC, bit_vector supports boolean operations
directly, but for vector<bool> you'll probably need to write your own.

--
Later,
Jerry.

The universe is a figment of its own imagination.

Posted by CBFalconer on February 21st, 2007


Jerry Coffin wrote:
The fault is basically that of the OP, but please do not post
off-topic information on c.l.c. It is easy to overlook the foolish
cross-posting, but you should not answer in any group where the
answer (and question) are off topic. F'ups set.

--
<http://www.cs.auckland.ac.nz/~pgut001/pubs/vista_cost.txt>
<http://www.securityfocus.com/columnists/423>

"A man who is right every time is not likely to do very much."
-- Francis Crick, co-discover of DNA
"There is nothing more amazing than stupidity in action."
-- Thomas Matthews



Posted by Randy Howard on March 6th, 2007


On Tue, 20 Feb 2007 18:56:19 -0600, NAME% wrote
(in article <1172019379.125896.186810@a75g2000cwd.googlegroups .com>):

What in the world is a disnomer? Is this another one of those "we made
up a word because the mood struck us" deals?


--
Randy Howard (2reply remove FOOBAR)
"The power of accurate observation is called cynicism by those
who have not got it." - George Bernard Shaw







Similar Posts