A (fledgling) Plan for p2p Search
I know p2p search is hopeless [searchenginewatch.com], but here's some ideas on how to do it anyways. I'll phrase it like an inductive proof: first make a node, then add a neighbor.
NODE - I'd use Lucene [apache.org]. Lucene is a traditional keyword search engine that is fast, lean, free and open. It's carried under the Apache Jakarta project, so it's not going anywhere. And, it's easy to develop with. Alternatively, any good search will do... you could probably bang something together with GNU shell utils. You would need a spider of some sort to grab internet content and store it locally. Again, a GNU tool like wget would do. Perhaps it would spider N-deep from what you browse, so it would perhaps have ready responses for many of your queries.
NEIGHBOR - Turn search into a common TCP/IP protocol, a la SMTP, FTP, etc.. Telnet to port 53268 (the digits that most look like "SEARCH", leaving one out, to make it in the legal range of 1-65536), and have something like this:
client: QRY p2p search efforts
server: HITS 1023
client: RETR 0
server: HIT http://searchenginewatch.com/sereport/article.php/2163581
...
If there are no results at that node, the server forwards you on:
client: QRY p2p search efforts
server: FWD 255.168.1.303
So, you'd start by querying your own host's search-engine. It may already have a number of hits, in which case you don't even use the network for searching at all! But your own node may not have the answer for you, so you forward on to the next. How does the forwarding table get setup? One way to do it would be by hand, but also, I imagine posting "known expert" lists to gnutella could help automate the process. A list would be a map of keywords to IPs. These lists wouldn't need to be too robust, as they'd serve to occasionally seed the network, not constantly sustain it.
Once you had a good forwarding table on your node, you'd have access to quite a large search DB. With 100 nodes in the search network, each using 1GB for its index, and 3:10 index to indexed ratio, that's 100*1GB*3.3=330GB of indexed text. Let's say the average webpage is 100KB (?), that's a total search DB size of 3.4M pages. Increase the number of nodes to 10,000 and increase each node's index size to 10GB, and you have 3,460,300,800 pages, which is just about equal to Google, which is currently at 3,307,998,701. 10k nodes happens to be about what distributed.net is running right now, and 10GB is getting cheaper by the minute. ;)