Information Retrieval (Introduction to Medical Informatics) (http://www.cpmc.columbia.edu/edu/textbook) LAST REVIEWED: 12 November 1996 narrative text - huge collection of text, want to find the relevant parts. Usually generated by many persons, many departments, .... No one really knows what is there, what is relevant, how to get it. INFORMATION RETRIEVAL (IR) the name is usually applied to narrative text, but could refer to coded data goal is to retrieve information from: information sources like bibliographic DB electronic medical record for clinical care clinical research database eg biomedical literature 20,000 journals published; 17,000 new books per year size doubles every 10-15 years for 300 years 1879 Index Medicus: subject and author listings 1964 MEDLARS = Medical Literature Analysis and Retrieval System 1966 MEDLINE: computerized database of journals each paper categorized by subject subject vocabulary: MeSH (now UMLS) 1975: began to include abstracts Others: CINAHL, PsycINFO, HealthSTAR --Measuring IR 1. recall am I getting the information I want? also known as sensitivity let TP (true positive) = number of items I retrieved correctly recall = TP / (total # items I wanted) 2. precision am I getting extra information I do not want? also known as predictive value positive let FP (false positive) = number of items I retrieved that I did not really want precision = TP / (total # items retrieved) = TP / (TP + FP) (a) must define "items" for each query, eg: bibliographic: number of literature references clinical: number of lab results retrieved research: number of patients found to satisfy criteria (b) example: I want articles on aspirin causing renal failure there are 100 relevant papers in the literature bibliographic search for "aspirin" and "renal failure" I retrieve: 80 papers on aspirin causing renal failure (some are missed due to synonyms: "kidney") 120 papers on other topics (eg, aspirin dosage in renal failure) TP = 80 FP = 120 recall = 80 / 100 = 0.8 precision = 80 / (80+120) = 0.4 the goal is to get the highest recall and precision possible there is a tradeoff between the two broader search higher recall lower precision narrower search lower recall higher precision METHODS FOR NARRATIVE TEXT assume there is a large collection of documents you want to select the ones that are relevant to you "document" (show example) a document contains a passage of narrative text each document counts as one "item" as defined above examples of a documents in a collection: a paper published in the literature an article in a newspaper a radiology report in a medical record a paragraph in a larger document 1. KEYWORD SEARCH manually assign to each doc set of keywords that identify it ("indexing" or "coding") keywords are chosen from a standard vocabulary perform Boolean search on the keywords and, or, not MEDLINE 6 million articles from 3500 journals, hand indexed keyword vocabulary is MeSH, a hierarchy of terms can explode = query by class can add modifiers, roles, relations (diagnosis of) eg, in MEDLINE (using MeSH vocabulary) search for "aspirin" and "kidney-failure" (show example) assessment very efficient to do the query (only code once) but laborious to assign keywords in first place (15 minutes, $4/ art.) must phrase query in terms of vocabulary ie, must find the correct synonym to use vocabulary may have different focus than the searcher coders may not be not consistent (61% if critical, 34% if main/sub heading) as vocabulary expands, old documents are not re-coded 2. FREE TEXT SEARCH look for the occurence of words or phrases anywhere in a document instead of coder figuring out ahead of time, user codes it query techniques => 1. Boolean: and, or, not ("renal insufficiency" or "kidney failure") and ("aspirin" or "ASA") 2. vicinity: two phrases must occur one after the other (Adj), in same sentence (With), in same paragraph (Same), or just in same document (And) "renal" with "insufficiency" 3. stemming: often suffix does not matter (or even prefix) bacteria* = bacteria, bacterial, ... 4. sections: ask about particular sections of a document if it is formatted (title, abstract) if word found in title, probably more pertinent ex: ("renal failure" or "aspirin"). 5. index file: inverted index of words for speed made of list of words with pointers to documents eg, BRS free text search engine can use MEDLINE database, but allows free text searching on abstracts, rather than just keyword search assessment of free-text search can tailor query to the searcher's needs must state synonyms explicitly (usually miss documents) ie, harder than keyword method must state ALL possible synonyms never know if miss one overall: precise but poor recall 3. SCORING ALGORITHMS a. Basic Method create a matrix showing how many times a given word appears in each document ("frequency") drop unhelpful common words like (the, and, or) words are pre-stemmed doc# kidney renal failure insuffic* heart aspirin mg... ------------------------------------------------------- 1 0 6 0 2 0 2 3 2 6 0 0 1 0 4 20 3 0 0 2 0 20 7 20 ... where doc 1 is the relevant example listed below doc 2 is another relevant document doc 3 is about heart failure and aspirin therapy calculate a score algorithm based upon matching words in query to words in the document, weighted by the frequency for that document then rank documents in descending order helps when 100s of documents are retrieved eg, (asprin, kidney, failure) for each word in the query that matches a word in the document, add its frequency to the score doc 2 = 10 doc 3 = 9 doc 1 = 2 b. Vicinity - include word order, sentence # in matrix for vicinity checks "phrase" is a subset of this eg, could reduce score of doc 3 by specifying that "failure" must come after "kidney", so "heart failure" will not be counted c. Discrimination best words are those that appear frequently in desired reports but not in most others words that occur commonly in all reports ("medicine") do not discriminate among documents, but they make up a large part of the score (more important, they increase the variance of the score, which washes out the effect of the better words) low frequency words not that useful, since they do not occur in documents medium frequency words are the most helpful cannot eliminate all common words (continuum of commonness) therefore weight words by discriminating ability rather than by raw frequency weight proportional to frequency in this document weight inversely proportional to frequency in others eg, "mg" (milligram) appears in many documents, and does not add much information; therefore its affect on the overall score should be small variation: inverse document frequency d. Thesaurus rather than using words found in document (thus suffering from the synonym problem), match words to a preferred synonym found in a thesaurus (vocabulary) eg, UMLS, which contains many vocabularies (MeSH, ICD-9, ...) e. Clustering study the collection of documents and create word clusters statistically, rather than using a thesaurus thus synonyms are based on the relevant body of knowledge, not an outside thesaurus find where a word commonly occurs, and then look for other words that occur in the same places renal "insufficiency" renal "failure" "insufficiency" =? "failure" eg, if "kidney" and "failure" are the names of clusters doc# kidney failure heart aspirin mg... --------------------------------------- 1 6 4 0 2 3 2 6 1 0 4 20 3 0 2 20 7 20 ... doc 1 = 12 doc 2 = 11 doc 3 = 9 f. Natural Language Processing (NLP) rather than a matrix of words, use higher level concepts use NLP to parse the syntactic structure, in order to find groups of related words, such as noun phrases (this sort of NLP is relatively easy) eg, want to find "kidney failure" document about heart failure that mentions kidney excretion of a heart medication would match inappropriately the concept "kidney failure" would not have matched have found that syntax alone is not as good as simple frequency analysis, but with semantic information might be okay g. N-grams - co-occurence of words 4. OTHER APPROACHES a. Factor Analysis create a small set of factors that categorize the reports (can be done by hand or automatically) query consists of specifying weights or values for factors b. Neural Network associative network - given query match words to documents but large number of documents, how to train c. Feedback show the searcher the initial results, asking which documents are relevant, then re-do the query automatically taking answer into account existing system hybrid of scoring algorithm + neural network USER SYSTEMS MEDLINE and BRS and others exist for some time but: still need intermediary to make a reasonable search until recently, 85% searches done by intermediary usually the librarian need to know syntax; Boolean searches are an art but then cannot get information in real time eg, drug interaction info at bedside GRATEFUL MED: PC-based program to guide user through form screens to perform a structured search on MEDLINE uses both keyword and free text searches on MEDLINE helps user exploit structure of MeSH (parent-child) feedback: mark relevance of each retrieved article; it suggests a better search assembles search, then dials in to NLM MEDLINE survey 4.5 million searches/year users: .5 MD, .25 PhD .5 academic, .4 clinical not very experienced relatively young purpose: .46 research, .38 clinical per user: 4 search /month .7 using it less than one year satisfaction: .75 satisfied .5 good #, .25 too many, .15 too few articles .7 use MeSH WAIS (Wide Area Information Servers) protocol for coordinating disparate information sources first find the proper source (set of servers) then find relevant part within the source uses heuristic scoring algorithm try it: logon with id "wais" to "quake.think.com" similar ideas gopher: (run "gopher" on "cunixf") anonymous ftp Medline DB with BRS search engine "aspirin" and "kidney failure" BI PaperChase RETRIEVAL EXAMPLE Free Text Area: TI Renovascular effects of nonprescription ibuprofen in elderly hypertensive patients with mild renal impairment. AB To determine the renovascular effects of nonprescription ibuprofen in the maximum labeled over-the-counter (OTC) dosage for 7 days, and to compare these effects with those of two other available OTC analgesics, aspirin and acetaminophen, we evaluated 25 elderly patients with mild thiazide-treated hypertension and mild renal insufficiency. Under double-blind conditions, patients were randomly allocated to one of three treatment groups: ibuprofen 400 mg 3 times/day, aspirin 650 mg 3 times/day, or acetaminophen 650 mg 3 times/day. Blood pressure and indexes of renal function (blood urea nitrogen, creatinine clearance, serum electrolytes) were measured over 7 days in a clinical research center. None of the treatments had a clinically significant effect on blood pressure. Renal function indexes also remained unchanged during all three treatments. We conclude that elderly patients with mild thiazide-treated hypertension and mild renal insufficiency seem not to be at risk of developing additional renal compromise or of having their hypertension control diminished by treatment with these OTC analgesics for 7 days. Author-abstract. Keyword Area: MJ HYPERTENSION-RENAL: complications (co). IBUPROFEN: pharmacology (pd). KIDNEY: drug-effects (de). KIDNEY-FAILURE: complications (co). MN ACETAMINOPHEN: pharmacology (pd). AGED. AGED-80-AND-OVER. ASPIRIN: pharmacology (pd). BLOOD-PRESSURE: drug-effects (de). BODY-WEIGHT. COMPARATIVE-STUDY. DOUBLE-BLIND-METHOD. DRUGS-NON-PRESCRIPTION: pharmacology (pd). FEMALE. HUMAN. HYPERTENSION-RENAL: drug-therapy (dt), physiopathology (pp). KIDNEY-FAILURE: physiopathology (pp). MALE. MIDDLE-AGE. RENAL-CIRCULATION: drug-effects (de). SUPPORT-NON-U-S-GOVT. Where: ASPIRIN (in MJ, MN) = keyword search match aspirin (in TI, AB) = free text search match renal impairment (in TI, AB) = free text search failure due to synonyms related reading: Hersh WR, Greenes RA. Information retrieval in medicine: state of the art. M.D. Computing 1990;7(5):302-11. Wallingford KT, Humphreys BL, Selinger NE, Siegel ER. Bibliographic retrieval: a survey of individual users of Medline. M.D. Computing 1990;7(3):166-71.