infoRetrieval

Index

Biword Index

Phrase Index

Positional Index

Inverted Index

Positional Index vs Inverted index

Inverse Document Frequency

Detail

Features of an inverted index.

Key concept of index

  1. Traversing a directory of documents
  2. Reading the document and extracting and tokenizing all of the text
  3. Computing counts of documents and terms
  4. Building a dictionary of unique terms that exist within the corpus
  5. Writing out to a disk file, a sorted term dictionary

Inverted Index Construction

Model

Report

documents, terms, unique terms

alt text

  1. Number of documents processed

  2. Total number of terms parsed from all documents

  3. Total number of unique terms found and added to the index

documents

here

index

here

Map-reduce

Implementation of hadoop Search technology Simple db keeping track of dictionary

Problem:

Alogrithm

DFS : data file system

Automatic parallel execution in mapReduce MapReduce in Hadoop

Dictionary Data Structures

alt text

Example: An array of struct as a naïve dictionary

options for dictionary structure:

alt text

Always slipping into half for searching.

Every node always has two outputs

e.g. B-Tree

alt text

every node has a number of children

Any particular level may have two or more outcomes. Level multiple options.

Tolerant Retrieval

Wild-card queries

e.g. *mon: to find words ending with mon

Query processing

e.g. finding hello → hello$, ello$h, lo$hel, o$hell execute different kind of search.

Cons: increase number of term in the dictionary

Permuterm query processing

Pros: use a lot more space for indexes

Bigram (k-gram) indexes

Finds term based on a query consisting of k-grams

Index Compression

Key Terms

  1. Dictionary compression ``` Aims to fit in the memory with an at least large portion of dictionaries.

The dictionary as a string that sorts the vocabulary lexicographically and stores it in an array of fixed-width entries or blocked storage by grouping terms into the string into blocks of size k and keeping a term pointer for the first term of each blog

2. Rule of 30

The 30 most common words account for 30% of the tokens in the written text.

Thus, the lossy method could be used for compression without losing its effectiveness in encoding the data.


3. Lossy Compression

Amount of data is lost during this process


4. Lossless Compression

No data is lost during compression


5. Heap’s law

To estimate the number of unique terms in a collection based upon constants k and b and the number of terms or tokens (T) parsed from all documents.

  M = kT<sup>ß</sup>

in which T is the number of tokens in a collection, k and ß are parameters values

![alt text](https://github.com/nglthu/infoRetrieval/blob/master/Heap_law/sizeOfM.png)

6. Zipf’s law

cf<sub>i</sub> = c<sub>i</sub><sup>k</sup> 

as one of the types of the power law

7. Power law

8. Front Coding

9. Variable Byte Encoding

10. Nibble

11. Unary Code

A string of n 1s followed by a 0

12. Encoding

Two type of methods such as bytewise and bitwise.

As such variable byte encoding uses the integral number of byte to encode a gap instead of docID.


13. Entropy

14. δ Codes

Asymptotically optimal for entropy H(P) → ∞ ```

Web crawler

What is involved in creating a web crawler?

Purpose: to get the information that is available on a website Process: As described by Manning (2009, chapter 15)

alt text

Figure : The basic crawler architecture extracted from Figure 20.1 (Manning, 2009, chapter 19)

Static vs dynamic web content

Static: the same prebuilt content each time the page is loaded Dynamic: content is changed and can be generated on the fly.

Query

Query Type

Boolean Retrieval vs Wildcard Queries vs Phrase Queries

Improve of Computing Score and Rank

References:

Manning, C.D., Raghaven, P., & Schütze, H. (2009). An Introduction to Information Retrieval (Online ed.). Cambridge, MA: Cambridge University Press. Available at http://nlp.stanford.edu/IR-book/information-retrieval-book.html