I developed a whole-web search engine from scratch!

I developed a whole-web search engine from scratch! now seo agency pakistan


Because I was very interested in search engine technology, I began developing a search engine. After referring to the limited information on the Internet, coupled with my research, I finally created a small search engine for the entire Internet, with the project address and search test pictures at the bottom.

The language of this project is PHP (the language is not essential, the main ones are ideas, architecture, and algorithms).

The general process of the search engine

  1. Web collection

Web crawlers need to be used to collect web pages. Due to the various and volatile Internet connections, a robust crawler system is required to deal with complex situations. Crawling strategies are generally divided into depth-first and breadth-first problems, and the specific choice depends on the situation. An HTTP request is very time-consuming, from 1 second to a few seconds, so you need to use multi-threaded crawling (I use curl_multi); simultaneously, you can configure cluster crawling conditionally.

Two, pretreatment

Preprocessing is the most complicated part of search engines. First, most ranking algorithms take effect in the preprocessing stage. Then, in the preprocessing step, the search engine mainly processes the following steps for the data:

Extract keywords

The page crawled by the spider is the same as the source code we view in the browser. However, the code is usually messy, and many of them are irrelevant to the page’s main content. Therefore, search engines need to do the following things:

① Code denoising. Remove all the code in the webpage, leaving only the text.

②Remove non-text keywords. For example, keywords in the navigation bar on the page and other common areas shared by different carriers.

③Remove stop words. Stop words refer to words with no specific meaning, such as “的”, “在”, etc.

When the search engine gets this web page’s keywords, it will use its word segmentation system to divide this article into a word segmentation list, store it in the database, and make a one-to-one correspondence with the URL of this article.

Web page deduplication

There are a large number of duplicate web content on the Internet. If it is not processed, it will be stored in the database, seriously affecting the search experience. This step involves the deduplication technology of massive data. Since deduplication cannot simply compare web pages for repetition by string comparison, the general logic for web page deduplication is to extract the fingerprints of the web page (involving natural language processing, word vectors, etc.) and then compare and remove duplicates.

Web denoising

In web denoising, remove useless content such as tags in the web page, and make full use of web code (such as H tag, substantial tag), keyword density, internal link anchor text, etc., to analyze the essential content of this web page Phrase.

Data save and update

When the amount of data comes up, all minor problems will become big problems. After a large amount of data is processed, it is stored in the database. The selection and design of the database are essential. Because it is necessary to consider the fast insertion and query of massive data, the saved data must also consider the issue of data update and design an update strategy. The crawling and updating of so much content will have higher requirements on the performance and quantity of the server.

Web page importance analysis

Determine the weight value of a webpage, combined with those mentioned above “important information analysis”, to establish the ranking coefficient of each keyword in the keyword set p of this webpage.

Inverted index

The reason why search engines can quickly find the corresponding content is because of the use of indexes. The index is a data structure. Generally speaking, search engines use an inverted index structure; that is, the content of the webpage is segmented first, and different document ids of the same segmentation are integrated, and so on. As a result, you can learn more about the relevant details. Search engines need to have a very high recall rate while ensuring search results, so the choice of word segmentation and word segmentation strategies need to be carefully considered and selected.

The index is divided into the complete index and incremental index. A full index is to update all at once, which is relatively time-consuming. The total index is to update only the “newly added content” index each time and then merge the query with the old index.

  1. Inquiry Service

As the name suggests, the query service is to process user query requests on the search interface. First, the search engine builds the retriever and then processes the request in four steps.

query rewrite

A considerable part of the search sentences may be unclear and incomplete. At this time, if the word segmentation search is performed according to the original content, the effect is not satisfactory. At this time, the query must be rewritten to make the search term more accurately express the searcher’s thoughts. , To achieve a higher recall rate.

Cut words according to query methods and keywords.

First, the keywords searched by the user are divided into a keyword sequence. For example, we temporarily use q to represent the keyword q searched by the user is divided into q={q1, q2, q3,…, qn}.

Then according to the user’s query method, such as whether all the words are connected, or there is a space in the middle, and according to the part of speech of the different keywords in q, determine the occupation of each word in the query result on the display of the query result. Importance.

Content filtering

There will be some illegal content in many web content, so you need to remove the relevant content to prevent it from being displayed at the front desk; sometimes, searchers will search for some sensitive content, and Content filtering must process the search query.

Sort search results

We have a search term set q, calculate the importance of each keyword in q relative to the document to which it belongs, and perform a comprehensive sorting algorithm, and the search results will come out. The sorting algorithm is the core of the search engine and affects the accuracy of the search results. However, the calculation method of sorting in practical applications is multi-dimensional and highly complicated.

Display search results and document summaries

When there are search results, the search engine will display the search results on the user interface for users to use. Generally, search terms will be marked red for a better display effect.


Marketing SEO Technology

Leave a Comment

Your email address will not be published. Required fields are marked *

Previous reading
Criptomoneda Venezolana Petro | Gana criptomonedas en 2022
Next reading
Google algorithm updates (updated to September 2021)