I Want To Create A Search Engine For Searching For Text Within Word?

Upload and start working with your PDF documents.
No downloads required

How To Create PDF Online?

Upload & Edit Your PDF Document
Save, Download, Print, and Share
Sign & Make It Legally Binding

I want to create a search engine for searching for text within word and PDF files. How can I proceed?

The core of a search engine is a reverse index. So, let's imagine the web as a a hypothetical database. You can think of the web as a 2 column table. URL and page contents. URL contains the URL of a page, and page contents contains it's contents. URL is the primary key. You type in the URL in your browser, the browser looks up by the primary key in the database, gets the row and shows you the page content. Good, right? Well, not good for search! Why? Because if you are searching for pizza, you will have to go through all the records and scan the page content column of every row for the word pizza. Bad, no? It has a performance of O(n), and when you are talking about the web, that n is going to get big. So, what do you do? You flip this table around. Make the page content the key and the URL the value. Not only do you do this, you split the page content into individual terms, and create a record for each term. So, let's say, you have 3 very small web pages that say http.//yummypizza.com - Italian Pizza http.//yummierpizza.com Sicilian Pizza http.//sexyshoes.com- Italian Shoes Our table now looks like this Italian - http.//yummypizza.com, http.//sexyshoes.com Pizza - http.//yummypizza.com, http.//yummierpizza.com Sicilian - http.//yummierpizza.com Shoes - http.//sexyshoes.com Now, when someone searches for Pizza, you simply look up the record for Pizza and you get both the web pages. Fastest search engine in the world, right? All it needs to do is look up one record. There, you beat google! But wait!. What if someone searches for Italian Pizza? Ooooh. What you can do is find the URLS that are common between the first entry and second sentry. Mathmetically speaking, you are performing a intersection of 2 sets. So, you need some sort of algorithm that can do a fast intersection across large data sets Now, you got it. Buy a few thousand servers, beat google at it's own game. Yeah! But, wait! what if someone searches Italy Pizza? You want Italian pizza to show up right? oooh So, you need to expand your terms using synonyms. Essentially, you need a dictionary of synonyms (words grouped together by meaning). Then when you insert records into a row, you insert the records into it;s synonyms. So, your table looks like this Italy - http.//yummypizza.com, http.//sexyshoes.com Italian - http.//yummypizza.com, http.//sexyshoes.com Pizza - http.//yummypizza.com, http.//yummierpizza.com Sicilian - http.//yummierpizza.com Shoes - http.//sexyshoes.com Now, you ready to beat Google! Are you looking for investors? because I'm there! But wait, what if someone searches I want Italian Pizza. and your web page for Italian pizza doesn;t have the words "I want ". Well, it doesn;t make sense to search for "I" and "want", right? So, you need to chuck out some words from the input. These are called stopwords. So, "I want Italian Pizza" becomes "Italian Pizza". All you need is a dictionary of stopwords. You can even use this to stop people from searching by bad words Ready to beat Google? Angel Investors all lines up? No wait. People make spelling mistakes. There are algorithms that convert words into codes based on how t sound. You can convert all the terms into soundex code, and convert your search term into soundex. If you don't have enough results when you search without soundex, you fall back to soundex Great! Now, I'm ready to rule the world! But, wait, what about ranking? When you are talking about showing millions of results, what you show as the first result matters. So, how do you sort them. Alphabetically? lame! Sorted by time? lame! Or how about if the word Pizza comes up more times in the web page you rank it higher? (And then SEO people will fill the page with PIzza Pizza Pizza Pizza... Anyone remember the days when SEO experts will ask you to put bunch of shit in your page ) You can't just let the user pick his/her own sorting! Because resorting millions of will kill your server. So, you come up with your own way of sorting the results and you physically store the records in that order. This means that you can read your column partially and you don;t have to sort on every search. This is where you can;t beat Google, because of Google's pagerank algoithm. That's proprietary, and that's Google's secret sauce Of course, you might not want to compete with google. You might be building a search engine to search through your library of kindle books, right? However, the core problem of algorithmically ranking search results is very hard to nail down. You might find that how you rank depends on what you are searching. This where most search engines have trouble with. And that's why Google basically beat every other web search engine BTW, you don't have to implement all this yourself. Solr which is built on Lucene provides all of this. You need to provide the data and decide how to rank your results.

Customers love our service for intuitive functionality



46 votes

Create PDF: All You Need to Know

Google to this. INDENT it is all… (.

What Our Customers Say

Deborah W.
Deborah W.
I corrected a mistake in my form and replaced it with the right information. It took a few minutes only! Thanks a lot!
James S.
James S.
The process of PDF correction has never been so easy. I’ve managed to create a new document faster than ever before!
William G.
William G.
It was really easy to fill out my PDF document and add a signature to it! This is a great service! I recommend it to you!
Denis B.
Denis B.
I edited the document with my mobile phone. It was fast and, as a result, I’ve got a professional-looking document.

Supporting Forms

Submit important papers on the go with the number one online document management solution. Use our web-based app to edit your PDFs without effort. We provide our customers with an array of up-to-date tools accessible from any Internet-connected device. Upload your PDF document to the editor. Browse for a file on your device or add it from an online location. Insert text, images, fillable fields, add or remove pages, sign your PDFs electronically, all without leaving your desk.