Data Review (Clinical and Research) (Introduction to Medical Informatics) (http://www.cpmc.columbia.edu/edu/textbook) LAST REVIEWED: 1 October 1997 CLINICAL RESEARCH NEEDS clinical care (treat patients) would usually be: get me result of patient's blood tests "intra-patient query" pt1: K, ... ==> pt2: K, K, Na, ... pt3: Na, ... ... clinical research "cross-patient queries" K: pt1, pt2, ... ==> Na: pt2, pt3, ... ... uses of cross-patient queries (actually can be used for care and research) find patient whose name you forgot look at patients similar to your own patient (prognosis) calculate likelihood of disease given findings clinical research screening find cases for teaching rounds hospital administration (how is length of stay) for example: want to assess the association in male patients between hypokalemia (low potassium) and death from myocardial infarction (heart attack) need to put several simpler queries together assume we have a database of coded clinical data; how would we approach answering the above query? find patients with A) sex=male 6840 B) myocardial infarction 717 B) (may need a superclass) C) hypokalemia (absent) C) potassium<3.5 mg/dl 1976 D) died after 1/1/85 579 E) A, B, C, and D == male, MI, lowK, died 16 F) A, B, and C == male, MI, lowK 82 G) A and B == male, MI 382 H) G not C == male, MI, never lowK 308 I) H and D == male, MI, never lowK, died 28 therefore 16/82 (20%) died with lowK vs. 28/308 (9%) died without lowK conclusion: hypokalemia is associated with death from MI not causal QUERYING CODED DATA query = retrieve data from the database, subject to constraints (storage of data has analogous problems) there is a problem for novice users want to make up queries on the fly (ad hoc) do not know what things are called (vocabulary) do not know where or how things are stored (schema) do not know how to phrase a question do not know how to write programs do not know how to ask the computer for data MENU-DRIVEN SYSTEMS problem to solve: user does not know what is available or how to say it vocabulary is too big does not provide procedural (or syntactic) support set of menus and selections to guide you through the available choices provide structure; limit choices at any one point (eg, CIS, library menu) for intra-patient queries can use menu and then browse full list (CIS) for cross-patient queries too many types of data to browse a large list (paper) therefore, follow menu with vocabulary searches more constrained set of terms can use alphabetical, browsing, ... (context dictates the structure of the interface) examples from paper disadvantage: limited to what the menus provide to repeat a query, need to go back through menus QUERY LANGUAGE / TOOLS allow user to type in a question directly 1. Structured Query Language (SQL) SQL is a prime example for relational databases SELECT FROM WHERE SELECT POTASSIUM FROM ELECTROLYTES WHERE PATIENT=123 AND DATE > "1993-01-01" good relatively declarative (say what rather than how or where) common standard among databases (learn it once) relatively understandable can create ad hoc queries can save queries and repeat later on bad still need to know what things are called and how things are stored (vocabulary, table names, column names, necessary joins) specific to relational databases 2. Medical Query Languages (MQL) example of a language written especially for medicine based upon COSTAR outpatient clinical database eg, WHEN CODE1 IS BLOOD PRESSURE AND DATE1 IS AFTER TODAY - 1 YEAR WHEN CODE2 IS BLOOD PRESSURE AND DATE2 FOLLOWS DATE1 LIST NAME, UNIT NUMBER (what is missing from this query) infers reasonable relations among attributes (ie, adds schema knowledge) CODE1 and DATE1, NAME and CODE1 handles time IS AFTER, 1 YEAR good less schema knowledge needed (only vocabulary) better handling of time bad specific to COSTAR databases still requires knowledge of Boolean logic 3. Query By Example (QBE) create a picture of a relational table graphical query language fill constraints into the table indicate which columns you want back good need not know specific syntax bad (like SQL) still need to know vocabulary, tables, columns specific to relational databases 4. Natural Language Question Answering System allow natural language queries not need to know rigid syntax produce coded queries (SQL) for the database possible because domain is limited, and because know database and schema even with a theoretically perfect system, it is still difficult to get data eg, need for filtering retrieval for potassium first one is too old to be relevant second must be an error, since out of range third one is an invalid test time value 1981-01-03 2.7 1993-09-09 50.1 1993-09-10 hemolyzed RESEARCH SYSTEMS clinical research advance medical science by adding to medical KB study a disease and its manifestations uncover relationships among diseases, findings, ... estimate prognosis assess effectiveness of therapy experiments limited by ethical and practical concerns often extend over many years (eg, 30 years) randomized controlled clinical trial compare intervention to control prospective avoid bias (randomize, blinding) retrospective studies look at data already collected for another purpose subject to bias functions of a research system efficient storage of data, with DB utilities make it easy for researcher to query data maintain data quality on input correctness = entry errors completeness = missing data consistency = same finding -> same code statistical support screening for candidates to enter study (must be linked to clinical database) general database packages cheap and readily available on all computers easy to use lack sophisticated statistical processing lack medicine-specific DB support (eg, time, studies) eg, Revelation, dBase, ... general statistical packages cheap and readily available on all computers easy to use use only simple files to store data inflexible for new statistical techniques descriptive statistics eg, mean, std deviation, range summarize quantity of data in a number overall properties, look for associations graphical display eg, scatter plot summarize quantity of data in a graph look for associations and biases analytic statistics eg, regression test specific hypotheses, complex descriptors eg, BMDP, SPSS, SAS medical research systems more expensive to build or buy specific to medical research longitudinal studies (data over long time) time as a special part of syntax Medlog - interactive, time-oriented database used to store and analyze longitudinal clinical data ClinQuery - real-time cross-patient access to Beth- Israel's clinical database (well used) current research 1. large quantity of data (no human can process it) 2. sharing data among institutions automated abstraction turn raw data into relevant high level concepts data -> syndromes -> diseases automated discovery of medical relationships time-lagged associations in DB may imply causality generate hypotheses test with retrospective experiments on DB common schema, vocabulary, query language related reading: Safran C, Porter D, Rury CD, Herrimann FR, Lightfoot J, Underhill LH, et. al. ClinQuery: searching a large clinical database. M.D. Computing 1990;7(3):144-53.