Flexible data processing, analysis and visualisation

Newsstream Logo

Twitter is one of the primary sources of information for many research projects and journalists when it comes to the analysis of social media content or Breaking News events. In this context, Twitter Analysis can be an entry point for further investigation and research.

Flexible solutions are required in order to curb the continuously growing real-time data streams. From a journalistic perspective, we want to move away from rigid Data Dashboards and provide journalists with greater flexibility for the task of data processing, analysis and visualisation.

Just in time for the UK General Elections, the project News-Stream 3.0 implemented a first demonstrator, which enables the monitoring of Twitter reactions to the election debates. The head-to-head race of the parties is shown on a time line, which compares the number of tweets from Labour and the Conservatives.

Graph showing mentions of Cameron and Milliband

Screenshot from the NewsStream Demonstrator showing a comparison of mentions of Cameron vs. Milliband on Twitter

A look under the hood
The demonstrator has been developed at an early stage in the project, in time for the first milestone, at which News-Stream 3.0 presented its requirement analysis and a rough concept. At this point, it is more interesting to consider the technologies used than the actual results of the analysis.

The demonstrator is backed by a mature Big Data Infrastructure: a Hadoop cluster with 16 nodes and a total storage capacity of 100 terabytes, on which Cloudera’s Open-Source Distribution is operated. The latter enables both distributed batch processing and real-time analysis with Apache Spark. For high-performance data delivery Cloudera connects the distributed Open Source Search Solution Apache Solr.

Banana Dashboards
The dashboard used originates from a different context: the Log File Analysis. While Big Data is not yet an issue for many companies, in IT operations the collaborative analysis of large amounts of log files is now commonplace. This is also due to the interactive dashboard “Kibana”, which was originally developed as a demo application for the Open Source Search Elasticsearch. Twitter is currently also for log files the most important dimension: for example, here it is about the number of users or error messages per unit of time. With Kibana it is possible to create – with a few clicks – a new dashboard as a copy or add a widget. The choice ranges from column and pie charts to maps, tagclouds and lists. For the users of log analysis tools, flexibility is key: for example, when logging additional information it must be possible to access this information and display it in the Dahsboard. Given the hectic working conditions in IT operations, good usability is also of great importance.

The similarity to the requirements of editors is striking. For us it was therefore natural to use a Dashboard like “Kibana” for the Twitter Analysis. In order to enable a seamless integration into Cloudera CDH, we employed a development branch of Kibana named “Banana”. The Twitter Analysis is just the beginning for us. The next step will be about connecting and integrating a variety of sources in order to examine the usage patterns of editors. Results of text analysis algorithms developed in the project will take the place of the metadata supplied by data providers such as Twitter. The current demonstrator will serve as a tool box. One task will be the export of widgets or rather the identified data sets for use in other formats and applications. The further visual development will also play a role. Here easy expandability is ensured, since the chosen solution is based on the open source library D3.js, which is popular in data journalism.

Author: Neofonie/DW Innovation
Translation: Birgit Gray