IS2140-IR: 2014

Thursday, April 17, 2014

Week14——Reading notes(unit 14)

Information retrieval on the Semantic Web

They focus on three scenarios: IR, Q&A and complex question answering. They utilise a set of ontologies and a hybrid information retrieval mechanism in order to solve the conflict between the person's and machine's views of the query. They also used precision and recall to evaluation their system's performance. Their future work is a more complicated search engine and added with users' feedback to improve accuracy.

week14——Muddiest points(unit 13)

Unit 13 Text classification and clustering

About Classification: before classify text, need we to do the semantic and grammar analysis? Can we use the same methods in language model to smooth?
How to compare the performance of each methods in different cases?

Friday, April 11, 2014

Week13——Reading notes(unit 13)

Chapter 13. Text classification and naive bayes
1. To achieve good recall, standing queries thus have to be refined over time and can gradually become quite complex. Given a set of classes, we seek to determine which class(es) a given object belongs to.

2.The classification task is called text classification, text categorisation, topic classification or topic spotting.

3. Naive Bayes is robust, meaning that it can be applied to many different learning problems and is unlikely to produce classifiers that fail catastrophically. It is a probabilistic learning method. To eliminate zeros, we use add-one or Laplace smoothing, which simply adds one to each count.

4. Positional independence: the conditional probabilities for a term are the same independent of position in the document.

5. Feature selection for multiple classifiers: it is desirable to select a single set of features instead of a different one for each classifier. Feature selection statistics are first computed separately for each class on the two-class classification task c versus c and then combined.

Chapter 14. Vector space classification
Contiguity hypothesis: documents in the same class form a contiguous region and regions of different classes do not overlap. Knn or k nearest neighbour classification assigns the majority class of the k nearest neighbours to a test document. Knn requires no explicit training and can use the unprocessed training set directly in classification.

Week13——Muddiest points(unit 12)

Unit 12 Intelligent information retrieval

No questions of this unit.

Friday, April 4, 2014

Week12——Reading notes(unit 11)

User Profiles for Personalized Information Access

1. User profiling refers to use popular techniques for collecting information about users, representing and building user profiles.

2. Collecting information about users: the information collected may be explicitly input by the user or implicitly gathered by a software agent, collected on the user's client machine or gathered by the application server itself.

3. User profile representations: keyword profiles and semantic network profile.

4. User profile construction: building keyword profiles and semantic network profile, then building concept profiles.

Content-Based Recommendation Systems

1.Systems recommend an item to a user based upon a description of the item and a profile of the users' interests.

2. Item should be stored in database(item representation), and information about user profiles. Then we build a user model, including user's preference from the user history. We utilize decision tree by recursively partitioning training data and other classification methods. Also we need relevance feedback to evaluate the accuracy.

week12——Muddiest points

Unit 12: MLIA and Parallel IR

1. About bilingual dictionaries:
I doubt that the MRD is exited with different languages or to be generated dynamtic depends on query, and where and how to storage them. Another problem is that how do it judge which meaning of the words in the context?

2. About bilingual corpora:
what are the differences between parallel corpora and comparable corpora?

3. About translation disambiguation:
how to confirm a phrases or segment words into phrases?

Friday, March 28, 2014

Week11——Reading Notes(unit 11)

Week11——Muddiest Points

Unit10 Web Information Retrieval

1. About inverted indexes for web search engines: for the web search engines, they will be very large inverted indexes, how to storage and storage space are serious problems.

2. About web crawler: what is the exact processing of parsing the pages? How should we deal with the web crawl challenges, such as the hardware problems?

Friday, March 21, 2014

Week10——Reading notes(Unit 10)

Chapter 19 Web search basics
1. Web characteristics: web page authors created content in dozens of languages and thousands of dialects, thus demanding many different forms of stemming and other linguistic operations.
The web graph: static web consisting of static HTML pages together with the hyperlinks between them as a directed graph in which each web page is a node and each hyperlink a directed edge.
Spam: the manipulation of web page content for the purpose of appearing high up in search results for selected keywords.

2. Advertising as the economic model: convey to the viewer a positive feeling about the brand of the company placing the advertisement. (sponsored search or search advertising)

3. Search user experience:

Friday, February 28, 2014

Week8——Reading notes(MIR chapter10)

MIR chapter10 User interfaces and visualisation
1. Human-Computer Interaction:
Information access(design principles): provide informative feedback, permit easy reversal of actions, support an internal locus of control, reduce working memory load, and provide alternative interfaces for novice and expert users.
Information visualisation: to provide visual depictions of very large information spaces, using icons and colour highlighting, brushing and linking, panning and zooming, focus-plus-context, magic lenses and animation to retain context and help make occluded information visible.

2. Evaluate interactive system:
Precision and recall measures have been widely used for comparing the ranking results of non-interactive systems, but are less appropriate for assessing interactive systems.

3. Information access process: an interaction cycle consisting of query specification, receipt and examination of retrieval results and then either stopping or reformulating the query and repeating the process until a perfect result set is found. How to reformulate the query?

4. Starting points: lists, overviews, and automated source selection.

5. Query specification: command language, form fill in, menu selection, direct manipulation, and nature language.-boolean query specification.

Week8——Muddiest Points(unit 7)

Unit7: relevance feedback& query expansion

1. How to decide the value of trade-off between recall and precision?

2. Evaluate ranked retrieval:
The average values may have different performances, so how we judge the system is stable or not?
and I am confused about significant tests.

3. Relevance feedback:
about BM25, it is a more complicated feedback model, so does it will verify information need, and how to be based on experience to calculate feedback scores?

Friday, February 21, 2014

Week7——Reading Notes(unit7, IIR chapter9)

Chapter 9 Relevance feedback and query expansion:

1. A system can help with query refinement, either fully automatically or with the user in the loop.

2. Relevance feedback:it can also be effective in tracking a user's evolving information need: seeing some documents may lead users to refine their understanding of the information they are seeking, they can easily indicate relevant or non relevant images. It can improve recall and precision.

3. The Rocchio algorithm: it models a way of incorporating relevance feedback information into the vector space model. The underlying theory is the basic to evaluate similarity with relevant documents. The algorithm proposes using the modified query.

4. Naive bayes probabilistic model: use only collection statistics and information about the term distribution within the documents judged relevant. They preserve no memory of the original query.

5. Cases where relevance feedback alone is not sufficient include: misspellings, cross-language information retrieval and mismatch of searcher's vocabulary versus collection vocabulary.

Week7——Muddiest Points

Unit6: Evaluation of IR Systems:

1. Important issues in evaluation:

The professor mentioned that in IR system, evaluation is a compare study, so how can we set up the baseline and the judgement of models?

2. Pooling method is cheaper and more powerful, why we choose to pick up top 50 to put together as a pool?

Thursday, February 13, 2014

Week6——Reading notes(unit 6)

Chapter8 Evaluation in information retrieval

1. Information retrieval system evaluation:
Relevance is assessed relative to an information need, not a query. How can we judge the information need if we not contain all the words in the query.

2. Standard test collections:
the cornfield collection, TREC, GOV2, NTCIR, CLEF, Reuters, 20 Newsgroups. How we decide to choose standards in different situations?

3.Evaluation of unranked retrieval sets:
Precision: the fraction of retrieval documents that are relevant.
Recall: the fraction of relevant documents that are retrieved.
F measure: trades iff precision versus recall, which is the weighted harmonic mean of precision and recall.

4. Evaluation of ranked retrieval results:
Mean Average Precision: provide a single-figure measure of quality across recall levels.
ROC curve: plot the true positive rate or sensitivity against the false positive rate or (1- specificity).

5. Assessing relevance:
It is a time-consuming and expensive process involving human beings.
Pooling: relevance is assessed over a subset of the collection that is formed from the top k documents returned by a number of different IR systems and perhaps other sources such as the results of Boolean keyword searches or documents found by expert searchers in an interactive process.
The relevance of one document is treated as independent of the relevance of other documents in the collection.

week6——Muddiest Points

Unit 5: Probabilistic Retrieval Models:

1. About language models:

In IR system, why we only consider unigram, not n-gram? Does it means that one string or one word depending on the string or word before it? How we deal with the new words or new language?

2. About smoothing:

Add one: I think it is only used to eliminate zero probability, some strings or word still have lowest probabilities.
Linear interpolation: how we decide the value of λ?

3. LM vs VS:

LM depends on strings and words, so will it spend lots of space to storage probabilities, and does it not consider the whole document?

Friday, January 31, 2014

Week 4——Reading notes(unit 4, IIR section 1.3 &1.4 and chapter 6)

IIR section 1.3:
1. Intersection: intersect posting lists to quickly find documents that contain both terms; compared docID pointed to by both pointers.
2. Query optimization: the process of selecting how to organize the work of answering a query so that the least total amount of work needs to be done by the system.

IIR section 1.4:
1. Ranked retrieval models: vector space model, users largely use free text query and just typing one or more words rather than using a precise language with operators for building up query expressions.
2. A boolean model only records term presence or absence. To be able to do this we need term frequency information in posting lists.

IIR chapter 6
1.A score with respect to the query at hand: to index and retrieve documents and a simple means for scoring documents in response to a query.
2. Metadata: it means specific forms of data about a document.
Parametric indexes: merge postings from standard inverted indexes.
Zones: be similar to fields, except the contents of a zone can be arbitrary free text.
2.1 weighted zone scoring, 2.2 learning weights, 2.3 the optimal weight g
3. Term frequency and weighting: scoring has hinged on whether or not a query term is present in a zone within a document. All words in a document are not equally important.
3.1 inverse document frequency, 3.2 Tf-idf weighting
4. The vector space model for scoring: captures the relative importance of the terms in a document. Dot products, queries as vectors, computing vector scores for building the vector space model to score the documents.
5. Variant tf-idf functions: sublinear tf scaling, maximum tf normalization, document and query weighting schemes and pivoted normalized document length.

Week 4—— Muddiest Points

This week talks about the construction and compression about index.
About inverted index:
1. It looks like that we should build postings to storage counts and positions of terms, so it need time and space, how we compare the spent with not using inverted index?
2. What about the data structure of postings and deal with new words?
3. Which situation should we choose hast table or B-tree for access method.

Friday, January 24, 2014

Week3——Reading notes(unit 3,IIR chapter 4 and 5)

IIR chapter 4: index construction

1. For building an IR system, many decisions are based on the characteristics of the computer hardware on which the system runs.
1.1 Caching: to keep frequently used disk data in main memory.
1.2 Seek time: the time of the disk head to move to the part of the disk where the data are located, point: no data are being transferred during the seek; chunk of data is more saving time.
1.3 Buffer: the part of main memory where a block being read or written is stored. Data transfers by the system bus and storing compressed data on disk.

2. Blocked sort-based indexing: the term is the dominant key and docID is the secondary key.
2.1 termID: compile the vocabulary in the first pass and construct the inverted index in the second pass.
2.2 Reuter-RCV1: we only concern with text. Reuter-RCV1 has 100 million tokens, uses 4 bytes for termID and docID.
Blocked sort-based indexing algorithm：parses documents into termID-docID pairs and accumulates the pairs in memory until a block of a fixed size is full.
Inversion: the first step is to sort the termID-docID pairs. The second step is to collect all termID-docID pairs with the same termID into a postings(docID) list. The final step is to simultaneously merge the ten blocks into one large merged index.

3. SPIMI：uses terms instead of termIDs, writes each block's dictionary to disk, starts a new dictionary for the next block.

4. Difference between BSBI and SPIMI: SPIMI adds a posting directly to posting list in order to be faster and to save memory. But the overall memory requirements for the dynamically constructed index of a block is SPIMI are lower than in BSBI. SPIMI has compression to storage postings and dictionary. The time complexity of BSBI is T log T and SPIMI is T.

5. Distributed indexing: use large computer clusters for web index. Building a term-partitioned index, method is an application of MapReduce in order to divide the work up into chunks to easily assign and reassign. It offers a robust and conceptually simple framework for implementing index construction.
5.1 Split: split up the computing job into chunks to process in a short time.
5.2 Map phase: map splits of the input data to key-value pairs.
5.3 inverter: collect all values for a given key into one list.

6. Dynamic indexing: to periodically reconstruct the index from scratch; to maintain two indexes; Logarithmic merging: each token is initially added to in-memory index Z0 by LMergeAddToken. They build a new index from scratch periodically instead of reconstruction. Query processing is then switched from the new index and the old index is deleted.

7. In ranked retrieval, postings are often ordered according to weight or impact with the highest-weighted postings occurring first.

8. Security and user authorization are to represent each document as the set of users that can access them and then invert the result into user-document matrix.

IIR Chapter 5: index compression

1. Benefits of compression: to increase use of caching; faster transfer of data from disk to memory.

2. The rule of 30: the 30 most common words account for 30% of the tokens in written text. Lossless means that all information is preserved.

3. Heap's law: estimating the number of terms(M = kTb), M means vocabulary size, T means number of tokens.
Zipf's law: modeling the distribution of terms, if t1 is the most common term in the collection, t2 is the next most common, then the collection frequency cfi of the ith most common term is proportional to 1/i.

4. Dictionary compression: to fit it in main memory, or at least a large portion of it, to support high query through put.
4.1 String: store the dictionary terms as one long sting of characters.
4.2 Blocked storage: group terms in the string into blocks of size k and keep a term pointer only for the first term of each block. To increase K can get better compression. We partition the dictionary onto pages that are stored on disk, then we can index the first term of each page using a B-tree.

5. Postings file compression: one that uses fewer than 20 bits per document.
5.1 Variable byte codes: use an integral number of bytes to encode a gap. It is a more than 50% reduction of the size of the uncompressed index.
5.2 r codes: split the representation of a gap G into a pair of length and offset. A r code is decoded by first reading the unary code up to the 0 that terminates it. The larger entropy means better decoding. We can calculate the space requirements for an inverted index compressed with r encoding.
r codes achieve great compression ratios: about 15% better than variable byte codes for Reuters-RCV1.

Friday, January 17, 2014

week2——Muddiest Points

This week, I learned about methods about documents and query processing. However, I have taken nature language processing, so I am confused on the differences about methods and processing between these two topics.
1. About N-gram:
Use the n-gram, such as bi-gram and tri-gram to scan each character, is it right that in IR we just stop when we check the matching term with key words? Do we not need to calculate probabilities of each appeared n-gram?

2. About stopword:
In different languages, there are different stopwords. I wonder that they will occupy space to storage and spend time to recognize them in different languages. And as it said, sometimes they will lead to mistaken in some special phrases, so is it necessary to use them?

3.I wonder to know the meaning of recall in stemming. And I didn't followed two points about why we use stemming.

4. I doubt that why WSD doesn't work. I think it maybe not useful in key words searching engine because it is not necessary to recognize the meaning of words or phrases. But I think it will words when the key word in different meanings. Is that right?

Friday, January 10, 2014

Week1——Reading notes-MIR(unit1)

1. The developments of IR.

2. IR problem: information need, translate need into a query, retrieve useful and relevant information, rank the information items according the relevance degree.

3. Web changed search: document collection, performance and scalability, large collections, medium of business, web spam.

Week1——Reading notes(Unit 2)

IIR Sections 1.2:

It introduced the processing of building an inverted index. Each document has their unique docID and the core step is sorting. Terms are alphabetical and same terms are split into a dictionary(in memory) and postings(in disk) in order to reduce storage spaces. DocID helps to sort postings for efficient query processing.

Optimized for storage and access efficiency:

1.singly linked lists: easily to insert documents and to use more advanced index strategies.

2.variable length arrays: using memory caches to increase speed and avoiding overhead pointers.

IIR Chapter 2: the term vocabulary and posting lists

1. It is important to use machine learning to decide how to encode nature language to obtain character sequence. We should store choice of encoding and document format. Then I think the problem of document units is case by case with different formats and languages.

2. Token: an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing.

Type: the class of all tokens containing the same character sequence.

Term: a type that is included in the IR system's dictionary.

3. Different languages have different token rules, not only refer to natural languages but also specified languages, such as in computer technology. One efficient strategy is to train the system by users.

4. Problem: The author just took some examples of few languages about token mistakes. Are there some common rules and refined methods to solve mistakes, or we only avoid those by human judgment s?

5. Stop words: common words with little value for matching user needs. It reduces the number of postings, but will miss some keyword of query of using stop lists. Therefore, web search engines have not use stop lists.

6. Token normalization: the process of canonicalizing tokens. The method is to implicitly create equivalence classes. There are may approaches to deal with to normalize single language and multiple languages.

7. Porter's algorithm: include 5 phases to stem English. Author provides details of processing and examples.

8. Skip pointers: shortcuts of posting lists. They are efficient merging and help with AND queries not be useful for OR queries. The author provides how to place skip pointers.

9. Biwords indexes: to consider every pair of consecutive terms in a document as a phrase. We group terms into nouns to be an extended byword.

IIR Chapter 3: Dictionaries and tolerant retrieval

1. Search structures: basing and search trees(B-tree).

2. Wildcard queries: used situations, combined B-tree and reverse B-tree. Permuterm indexes: to introduce a special symbol as an end mark, then build a permuterm index in various rotations of each term all link to the original vocabulary term. K-gram indexes: a sequence of K characters. But it is difficult and expensive to processing indexes.

3. Spelling correction algorithm: a notion of nearness or proximity between a pair of queries and when two correctly spelled queries are tied to select the one is more common.

4. Isolated-term correction: to correct a single query term at a time(method is edit distance and k-gram overlap). Edit distance: the minimum number of edit operations required to transform character strings s1 into s2. K-gram overlap: scan the postings for all k-grams in q.

5. Context sensitive spelling correction: to enumerate corrections of each of the three query terms even though each query term is correctly spelled then try substitutions of each correction in the phrase.

6. Phonetic correction: misspellings that arise because the user types a query that sounds like the target term. It is to build a phonetic hash for similar-sounding terms as the same value.

Week1——Reading notes-IES(unit1)

1. In the part of 1.1, author simply explained definition of information retrieval: representing, searching and manipulating large collections of electronic text and other human-language data.

2. Web search:
(1) web crawler, to store snapshot of the web pages for accurate results.
(2) ranking algorithm, include many features to rank link and continued refine features.

3. IR Systems: to be familiar with the architecture of IR system. An information need of users to conduct search and then issue a query to the system, using search engine to compute RSV and then output result lists.

4. To evaluate performance of IR system is necessary and important, it is about efficiency (time and space) and effectiveness(human judgements).

Week1——Reading notes-FOA(unit1)

FOA means finding out about and decision-makers' research cognitive activities with using modern distributed networks knowledge.

There are some notes of each parts of processing:
1. Asking a question:
We only know the details about users' questions, such as topic, time and other information, but know their knowledge structures and what are clearly answers they want. It needs us to design the search engine.

2. Constructing an answer
Retrieved good answers from corpus and refine algorithm to better answers.

3. Assessing the answer
It is a closing the loop to allow users to make a dialogue with search engine, and uses relevance feedback to constrain the questions in order to get better answers.

Thursday, January 9, 2014

Week1—— Muddiest Points

1. Big data
During the introduction of big data, I am confused about the value of big data. Does it means that the actual value of data sets or the meaningful value of using big data?

2. Unstructured information
To unstructured information, how to fetch practical information from images or videos? And what are the differences between this processing of information retrieval and data mining?

3. Taylor's model
I missed the point about which two stages have gap, does the stage of formalized to fill the gap?