Positional Index vs Inverted index
Number of documents processed
Total number of terms parsed from all documents
Total number of unique terms found and added to the index
Implementation of hadoop Search technology Simple db keeping track of dictionary
see 2
bob 1
spot 1
throw 1
Map (key=url, val=contents):
→ For each word w in contents, emit (w, “1”)
→ Reduce(key=word, values=uniq_counts):
Automatic parallel execution in mapReduce MapReduce in Hadoop
Example: An array of struct as a naïve dictionary
Each vocabulary term is hashed to an interger
Pros: faster lookup than tree
Cons: no easy way to find minor variants; no prefix search; expensive operation of rehasing.
Always slipping into half for searching.
Every node always has two outputs
e.g. B-Tree
every node has a number of children
Any particular level may have two or more outcomes. Level multiple options.
e.g. *mon: to find words ending with mon
B-trees handle * at the end of query
Permuterm index: handle * at the middle
e.g. finding hello → hello$, ello$h, lo$hel, o$hell execute different kind of search.
Cons: increase number of term in the dictionary
Pros: use a lot more space for indexes
Finds term based on a query consisting of k-grams
The dictionary as a string that sorts the vocabulary lexicographically and stores it in an array of fixed-width entries or blocked storage by grouping terms into the string into blocks of size k and keeping a term pointer for the first term of each blog
2. Rule of 30
The 30 most common words account for 30% of the tokens in the written text.
Thus, the lossy method could be used for compression without losing its effectiveness in encoding the data.
3. Lossy Compression
Amount of data is lost during this process
4. Lossless Compression
No data is lost during compression
5. Heap’s law
To estimate the number of unique terms in a collection based upon constants k and b and the number of terms or tokens (T) parsed from all documents.
M = kT<sup>ß</sup>
in which T is the number of tokens in a collection, k and ß are parameters values
![alt text](https://github.com/nglthu/infoRetrieval/blob/master/Heap_law/sizeOfM.png)
6. Zipf’s law
cf<sub>i</sub> = c<sub>i</sub><sup>k</sup>
as one of the types of the power law
7. Power law
8. Front Coding
9. Variable Byte Encoding
10. Nibble
11. Unary Code
A string of n 1s followed by a 0
12. Encoding
Two type of methods such as bytewise and bitwise.
As such variable byte encoding uses the integral number of byte to encode a gap instead of docID.
13. Entropy
14. δ Codes
Asymptotically optimal for entropy H(P) → ∞ ```
Purpose: to get the information that is available on a website Process: As described by Manning (2009, chapter 15)
Figure : The basic crawler architecture extracted from Figure 20.1 (Manning, 2009, chapter 19)
Static: the same prebuilt content each time the page is loaded Dynamic: content is changed and can be generated on the fly.
Boolean Retrieval vs Wildcard Queries vs Phrase Queries
Improve of Computing Score and Rank
Manning, C.D., Raghaven, P., & Schütze, H. (2009). An Introduction to Information Retrieval (Online ed.). Cambridge, MA: Cambridge University Press. Available at http://nlp.stanford.edu/IR-book/information-retrieval-book.html