Temporal Text Mining

At Siemens Corporate Research I develop decision support systems based on different types of data. The following publications cover some aspects of my work dealing with documents. Staying true to my earlier work the methods involve temporal data mining methods applied in the context of text mining.

The Organic pie charts are a novel visualization of complex high dimensional data such as documents. They adapt the paradigm of pie charts with a circular display with segments that represent similar documents. The rugged outline of the chart, however, emerges from the data. The method to create the visualization combines one-dimensional Emergent Self-organizing Map (ESOM) trainig and multi-scale time series analysis.

The PubMed archive contains more than 18M biomedical research abstracts. Temporal text mining can reveal interesting aspects in this huge collections of documents. We have investigated the prediction of emerging trends that may indicate important new technologies. For this purpose we extracted the trends of about 180k words from 1.5M documents on cancer from 1975-2007. We identified 81 terms that represent biomarkers (genes, proteins, etc.) with a big impact during the time under study. Our method was able to predict the breakthrough of the biomarkers several years in advance with high accuracy.

The Geospace & Media Tool combines news with geospatial and statistical information. SCR developed an efficient text clustering engine that utilizes Locality Sensitive Hashing. The study included interesting result of a parameter study on the large scale data: the tradeoff between speed and accuracy of LSH and effects of pruning the feature space and the cluster representations were analyzed.

Mörchen, F.: Organic pie charts, In Proceedings IEEE International Conference on Data Mining, 30, (2008), pp. 947-952
Mörchen, F., Fradkin, D., Dejori, M., Wachmann, B.: Emerging trend prediction in biomedical literature, In Proceedings American Medical Informatics Association (AMIA) 2008 Annual Symposium, 29, (2008) PubMed
Mörchen, F., Dejori, M., Fradkin, D., Etienne, J., Wachmann, B., Bundschus, M.: Anticipating annotations and emerging trends in biomedical literature, In Proceedings Thirteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 26, (2008), pp. 954-962
Mörchen, F., Brinker, K., Neubauer, C.: Any-time clustering of high frequency news streams, In Proceedings Data Mining Case Studies Workshop (DMCS), The Thirteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,, San Jose, CA, USA, 23, (2007)

An organic pie chart for 354 news paper articles about 'bush' (based on the LA Times dataset from the Cluto collection). Pie segments bursting from the center indicate groups of similar articles. An automatic segmentation with 20 segments is overlayed. Each segment is annotated with the most important words (stems) and the majority class (not used for training) inside the circle.

Number of years when important cancer biomarkers were predicted in advance of their later inclusion into the MeSH vocabulary.