Spatialbits: Text Mining INSPIRE Conference Contributions

[This post has been moved to: http://blog.spatialbits.de/post/inspire2014/]

So the INSPIRE Conference 2014 (#inspire_eu2014) starts tomorrow - after two days of intensive workshops.
For me this poses the challenge to decide which of the parallel sessions I should attend to. As I have been experimenting with the R framework lately I decided to make use of some text mining techniques instead of reading through all the abstracts to get an idea about hot topics, trends and potentially interesting sessions.

Here are some of my 'results'. More on the methodology below.

To get a first impression I take a look at terms that appear frequently (15+) in the contribution's titles:

And the same for terms in the abstracts (150+)

Also from the abstracts a nicer looking wordcloud (100+):

Now I'd like to identify contributions that deal with topics of interest (e.g. "benefits" (2+), "health" (1+) or "metadata" (5+)):

Taken the 'contribution ID' (just the number) I can access the full abstract:

http://inspire.ec.europa.eu/events/conferences/inspire_2014/schedule/submissions/<ID>.html

Besides that the tm-package offers a lot of functionality to analyse the datasets further. For example I can identify terms that are correlated to a specific term. For instance terms that are correlated (0.5+) with "wfs" (considering all abstracs) are:

So a few words on what I did.

Getting ready:

download the abstracts (wget)
removing headlines, html, blank lines, line breaks (sed, tr)
extracting abstracts, titles (sed)

r-project/tm

convert all characters to lower-case
remove numbers, punctuation, whitespaces
remove URLs
remove stopwords
apply word stemming
apply stemcompletion

Now the set of documents is ready to run the analyses.

Follow @spatialbits

Spatialbits

Dienstag, 17. Juni 2014

Text Mining INSPIRE Conference Contributions

Keine Kommentare:

Kommentar veröffentlichen