IS2140-IR: January 2014

Friday, January 31, 2014

Week 4——Reading notes(unit 4, IIR section 1.3 &1.4 and chapter 6)

IIR section 1.3:
1. Intersection: intersect posting lists to quickly find documents that contain both terms; compared docID pointed to by both pointers.
2. Query optimization: the process of selecting how to organize the work of answering a query so that the least total amount of work needs to be done by the system.

IIR section 1.4:
1. Ranked retrieval models: vector space model, users largely use free text query and just typing one or more words rather than using a precise language with operators for building up query expressions.
2. A boolean model only records term presence or absence. To be able to do this we need term frequency information in posting lists.

IIR chapter 6
1.A score with respect to the query at hand: to index and retrieve documents and a simple means for scoring documents in response to a query.
2. Metadata: it means specific forms of data about a document.
Parametric indexes: merge postings from standard inverted indexes.
Zones: be similar to fields, except the contents of a zone can be arbitrary free text.
2.1 weighted zone scoring, 2.2 learning weights, 2.3 the optimal weight g
3. Term frequency and weighting: scoring has hinged on whether or not a query term is present in a zone within a document. All words in a document are not equally important.
3.1 inverse document frequency, 3.2 Tf-idf weighting
4. The vector space model for scoring: captures the relative importance of the terms in a document. Dot products, queries as vectors, computing vector scores for building the vector space model to score the documents.
5. Variant tf-idf functions: sublinear tf scaling, maximum tf normalization, document and query weighting schemes and pivoted normalized document length.

Week 4—— Muddiest Points

This week talks about the construction and compression about index.
About inverted index:
1. It looks like that we should build postings to storage counts and positions of terms, so it need time and space, how we compare the spent with not using inverted index?
2. What about the data structure of postings and deal with new words?
3. Which situation should we choose hast table or B-tree for access method.

Friday, January 24, 2014

Week3——Reading notes(unit 3,IIR chapter 4 and 5)

IIR chapter 4: index construction

1. For building an IR system, many decisions are based on the characteristics of the computer hardware on which the system runs.
1.1 Caching: to keep frequently used disk data in main memory.
1.2 Seek time: the time of the disk head to move to the part of the disk where the data are located, point: no data are being transferred during the seek; chunk of data is more saving time.
1.3 Buffer: the part of main memory where a block being read or written is stored. Data transfers by the system bus and storing compressed data on disk.

2. Blocked sort-based indexing: the term is the dominant key and docID is the secondary key.
2.1 termID: compile the vocabulary in the first pass and construct the inverted index in the second pass.
2.2 Reuter-RCV1: we only concern with text. Reuter-RCV1 has 100 million tokens, uses 4 bytes for termID and docID.
Blocked sort-based indexing algorithm：parses documents into termID-docID pairs and accumulates the pairs in memory until a block of a fixed size is full.
Inversion: the first step is to sort the termID-docID pairs. The second step is to collect all termID-docID pairs with the same termID into a postings(docID) list. The final step is to simultaneously merge the ten blocks into one large merged index.

3. SPIMI：uses terms instead of termIDs, writes each block's dictionary to disk, starts a new dictionary for the next block.

4. Difference between BSBI and SPIMI: SPIMI adds a posting directly to posting list in order to be faster and to save memory. But the overall memory requirements for the dynamically constructed index of a block is SPIMI are lower than in BSBI. SPIMI has compression to storage postings and dictionary. The time complexity of BSBI is T log T and SPIMI is T.

5. Distributed indexing: use large computer clusters for web index. Building a term-partitioned index, method is an application of MapReduce in order to divide the work up into chunks to easily assign and reassign. It offers a robust and conceptually simple framework for implementing index construction.
5.1 Split: split up the computing job into chunks to process in a short time.
5.2 Map phase: map splits of the input data to key-value pairs.
5.3 inverter: collect all values for a given key into one list.

6. Dynamic indexing: to periodically reconstruct the index from scratch; to maintain two indexes; Logarithmic merging: each token is initially added to in-memory index Z0 by LMergeAddToken. They build a new index from scratch periodically instead of reconstruction. Query processing is then switched from the new index and the old index is deleted.

7. In ranked retrieval, postings are often ordered according to weight or impact with the highest-weighted postings occurring first.

8. Security and user authorization are to represent each document as the set of users that can access them and then invert the result into user-document matrix.

IIR Chapter 5: index compression

1. Benefits of compression: to increase use of caching; faster transfer of data from disk to memory.

2. The rule of 30: the 30 most common words account for 30% of the tokens in written text. Lossless means that all information is preserved.

3. Heap's law: estimating the number of terms(M = kTb), M means vocabulary size, T means number of tokens.
Zipf's law: modeling the distribution of terms, if t1 is the most common term in the collection, t2 is the next most common, then the collection frequency cfi of the ith most common term is proportional to 1/i.

4. Dictionary compression: to fit it in main memory, or at least a large portion of it, to support high query through put.
4.1 String: store the dictionary terms as one long sting of characters.
4.2 Blocked storage: group terms in the string into blocks of size k and keep a term pointer only for the first term of each block. To increase K can get better compression. We partition the dictionary onto pages that are stored on disk, then we can index the first term of each page using a B-tree.

5. Postings file compression: one that uses fewer than 20 bits per document.
5.1 Variable byte codes: use an integral number of bytes to encode a gap. It is a more than 50% reduction of the size of the uncompressed index.
5.2 r codes: split the representation of a gap G into a pair of length and offset. A r code is decoded by first reading the unary code up to the 0 that terminates it. The larger entropy means better decoding. We can calculate the space requirements for an inverted index compressed with r encoding.
r codes achieve great compression ratios: about 15% better than variable byte codes for Reuters-RCV1.

Friday, January 17, 2014

week2——Muddiest Points

This week, I learned about methods about documents and query processing. However, I have taken nature language processing, so I am confused on the differences about methods and processing between these two topics.
1. About N-gram:
Use the n-gram, such as bi-gram and tri-gram to scan each character, is it right that in IR we just stop when we check the matching term with key words? Do we not need to calculate probabilities of each appeared n-gram?

2. About stopword:
In different languages, there are different stopwords. I wonder that they will occupy space to storage and spend time to recognize them in different languages. And as it said, sometimes they will lead to mistaken in some special phrases, so is it necessary to use them?

3.I wonder to know the meaning of recall in stemming. And I didn't followed two points about why we use stemming.

4. I doubt that why WSD doesn't work. I think it maybe not useful in key words searching engine because it is not necessary to recognize the meaning of words or phrases. But I think it will words when the key word in different meanings. Is that right?

Friday, January 10, 2014

Week1——Reading notes-MIR(unit1)

1. The developments of IR.

2. IR problem: information need, translate need into a query, retrieve useful and relevant information, rank the information items according the relevance degree.

3. Web changed search: document collection, performance and scalability, large collections, medium of business, web spam.

Week1——Reading notes(Unit 2)

IIR Sections 1.2:

It introduced the processing of building an inverted index. Each document has their unique docID and the core step is sorting. Terms are alphabetical and same terms are split into a dictionary(in memory) and postings(in disk) in order to reduce storage spaces. DocID helps to sort postings for efficient query processing.

Optimized for storage and access efficiency:

1.singly linked lists: easily to insert documents and to use more advanced index strategies.

2.variable length arrays: using memory caches to increase speed and avoiding overhead pointers.

IIR Chapter 2: the term vocabulary and posting lists

1. It is important to use machine learning to decide how to encode nature language to obtain character sequence. We should store choice of encoding and document format. Then I think the problem of document units is case by case with different formats and languages.

2. Token: an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing.

Type: the class of all tokens containing the same character sequence.

Term: a type that is included in the IR system's dictionary.

3. Different languages have different token rules, not only refer to natural languages but also specified languages, such as in computer technology. One efficient strategy is to train the system by users.

4. Problem: The author just took some examples of few languages about token mistakes. Are there some common rules and refined methods to solve mistakes, or we only avoid those by human judgment s?

5. Stop words: common words with little value for matching user needs. It reduces the number of postings, but will miss some keyword of query of using stop lists. Therefore, web search engines have not use stop lists.

6. Token normalization: the process of canonicalizing tokens. The method is to implicitly create equivalence classes. There are may approaches to deal with to normalize single language and multiple languages.

7. Porter's algorithm: include 5 phases to stem English. Author provides details of processing and examples.

8. Skip pointers: shortcuts of posting lists. They are efficient merging and help with AND queries not be useful for OR queries. The author provides how to place skip pointers.

9. Biwords indexes: to consider every pair of consecutive terms in a document as a phrase. We group terms into nouns to be an extended byword.

IIR Chapter 3: Dictionaries and tolerant retrieval

1. Search structures: basing and search trees(B-tree).

2. Wildcard queries: used situations, combined B-tree and reverse B-tree. Permuterm indexes: to introduce a special symbol as an end mark, then build a permuterm index in various rotations of each term all link to the original vocabulary term. K-gram indexes: a sequence of K characters. But it is difficult and expensive to processing indexes.

3. Spelling correction algorithm: a notion of nearness or proximity between a pair of queries and when two correctly spelled queries are tied to select the one is more common.

4. Isolated-term correction: to correct a single query term at a time(method is edit distance and k-gram overlap). Edit distance: the minimum number of edit operations required to transform character strings s1 into s2. K-gram overlap: scan the postings for all k-grams in q.

5. Context sensitive spelling correction: to enumerate corrections of each of the three query terms even though each query term is correctly spelled then try substitutions of each correction in the phrase.

6. Phonetic correction: misspellings that arise because the user types a query that sounds like the target term. It is to build a phonetic hash for similar-sounding terms as the same value.

Week1——Reading notes-IES(unit1)

1. In the part of 1.1, author simply explained definition of information retrieval: representing, searching and manipulating large collections of electronic text and other human-language data.

2. Web search:
(1) web crawler, to store snapshot of the web pages for accurate results.
(2) ranking algorithm, include many features to rank link and continued refine features.

3. IR Systems: to be familiar with the architecture of IR system. An information need of users to conduct search and then issue a query to the system, using search engine to compute RSV and then output result lists.

4. To evaluate performance of IR system is necessary and important, it is about efficiency (time and space) and effectiveness(human judgements).

Week1——Reading notes-FOA(unit1)

FOA means finding out about and decision-makers' research cognitive activities with using modern distributed networks knowledge.

There are some notes of each parts of processing:
1. Asking a question:
We only know the details about users' questions, such as topic, time and other information, but know their knowledge structures and what are clearly answers they want. It needs us to design the search engine.

2. Constructing an answer
Retrieved good answers from corpus and refine algorithm to better answers.

3. Assessing the answer
It is a closing the loop to allow users to make a dialogue with search engine, and uses relevance feedback to constrain the questions in order to get better answers.

Thursday, January 9, 2014

Week1—— Muddiest Points

1. Big data
During the introduction of big data, I am confused about the value of big data. Does it means that the actual value of data sets or the meaningful value of using big data?

2. Unstructured information
To unstructured information, how to fetch practical information from images or videos? And what are the differences between this processing of information retrieval and data mining?

3. Taylor's model
I missed the point about which two stages have gap, does the stage of formalized to fill the gap?