MEAFA Professional Development Workshop on Social Media data extraction, management and analysis, 27-30 November 2017

Workshop overview

The total amount of information available in the digital sphere grows rapidly. According to current estimates, from 2013 to 2020, the 'digital universe' will grow by a factor of 10 – from 4.4 ZB to 44 ZB (1 Zettabyte is 100,000,000,000 GB). Large amounts of data are accumulated at social media sites. There are hundreds of the social media sites, created for different purposes; the largest of these have more than 2 billion registered users. Many social media sites provide API interface, that allow their data to be collected and analysed.

In this workshop we will review key concepts in data extraction and management, in particular, how data can be handled using relational database management systems, and understand how unstructured data from the WWW and social media can be gathered, parsed, loaded into the database, and analysed. In addition, the workshop will cover the basics of network science and its applications. The special attention will be put on overviewing concepts and characteristics that are useful for analysing social, economic and financial networks. For example, various centrality concepts which have been developed in social network analysis to identify the most influential players in social networks will be considered in detail. Network-based approaches for extracting useful information from financial data sources will also be presented.

The workshop will use the Python programming language (anaconda distribution), the Tweepy Python library for accessing the Twitter API and the PostgreSQL database (open source database management system).

Social Media workshop presenters

The presenters are Alexander Semenov of the Social Media Analysis Group, Department of Computer Science and Information Systems at the University of Jyvaskyla, Finland, and Alexander Veremyev, Department of Industrial Engineering & Management Systems, University of Central Florida, USA. Alexander Semenov is an expert in social media big data monitoring and analysis, including information spread and detection of identify theft, analysis of large social media graphs, evolving networks and blockchain networks, reposting cascades and clustering. Alexander Veremyev is an expert in mathematical modelling, development and analysis of efficient methods for solving various combinatorial optimization problems in complex networks including network interdiction, critical component detection, influence maximization, strategic network formation, efficient network design, learning in networks, consensus and cooperation, clustering and community detection, interdependent multilayer networks and network-based data mining.

Content decription

You may attend any one day or any combination of the following days.

Day 1 (Monday 27 Nov 2017): Introduction to Python by Dmytro Matsypura, Business Analytics, The University of Sydney

This day assumes no previous knowledge of Python. Days 2-4 of the workshop assume entry level knowledge of Python, so if you have no previous experience with Python then you should attend Day 1. The day introduces Python, its limitations and strengths, and core syntactic features. It discusses key principles and presents notable features, including data types, data structures and looping techniques. The day is of interest to those who are new to Python or have no prior programming experience. It is also useful to those with limited programming experience who wish to attain a more structural understanding of Python.

Days 2-4 (Tuesday-Thursday, 28-30 Nov 2017): Social Media data extraction, management and analysis by Alexander Semenov, University of Jyvaskyla, Finland, and Alexander Veremyev, University of Central Florida, USA.

The 3-day workshop assumes knowledge of Day 1. Day 2 begins by an introduction to Big Data concepts for both structured and unstructured data, and also introduces the SQL database management system and its PostgreSQL open-source equivalent. It reviews key SQL statements for data management. It then proceeds to introduce key network science concepts including key graph theory and lay down definitions such as network degree, connectivity and path. It concludes with visualization and basic characteristics computation of networks. Day 3 describes the data that is available on the Internet, its structure, limitations and restrictions, and then proceeds to demonstrate ways for extracting this data into useable databases. It discusses Social Media API and the JSON language. The extraction the largely text data, the workshop will also discuss regular expressions and the management of text. Day 3 finishes with an application on extracting data from the Twitter API. Day 4 analyses the social media data as networks. It demonstrates ways on how to identify the influential persons in a social network, centrality computations and network visualizations. It finishes with an application to mining Financial Data.

Enrolment and Fees

You may attend Day 1 alone, Days 2-4 or all days of the workshop. The cost for attending Day 1 alone is $600 per day. Days 2-4 are bundled together at $1800. Thus, to attend all four days it costs $2400. Prices include GST.

If you are paying with a University of Sydney credit card or through an internal journal entry then deduct the cost of GST.

Fees include extensive course material, code, data sets, use of computing facilities, and full catering throughout the days. To express your interest in attending you must complete the online form:

Expression of Interest

Numbers are limited and places are reserved on a first-come first-served basis upon the submission of the online EOI form. Successful attendees will be notified shortly after and invoices will be issued accordingly. Due to limited places, MEAFA maintains a no refund policy. For more information on enrolment and fees contact business.meafa@sydney.edu.au.

Net proceeds from the workshop go to funding MEAFA PhD scholarships.

Discounts

You may qualify for one of the following discounts:

  • 25% discount for a limited number of non-employed full-time research students.
  • 10% discount for additional attendees from the same business organisation, governmental department or academic unit.

Venue and computing facilities

The workshop takes place at The University of Sydney Business School Codrington Building H69, Computer Lab 1. The H69 Codrington Building is adjacent to the new Business School building. For directions, go to Campus Maps and search for the H69.

Desktop PCs are provided onsite. You can also work on your own laptop but you cannot access the web using the University of Sydney server. If you plan to work on your own laptop then make sure to install beforehand Python, Tweepy and PostgreSQL. No printing facilities are available.

Accommodation

MEAFA does not engage in the administration of temporary accommodation. It is up to you to find suitable living arrangements.

Timetable

All days have the same time schedule:

  • 08:40-09:00 - Welcome tea and coffee
    09:00-10:30 - Session 1
  • 10:30-10:45 - Morning break
    10:45-12:15 - Session 2
  • 12:15-13:15 - Lunch
    13:15-14:45 - Session 3
  • 14:45-15:00 - Afternoon break
    15:00-16:30 - Session 4
  • 16:30-17:00 - Buffer-time and user-specific questions

The computer labs will be accessible from 8am to 8pm every day.

Detailed Programme

Day 1 (Monday 27 Nov 2017): Introduction to Python

Session 1: Python basics
em>Python versions; Python distributions; installation; updates; help resources; interactive development environments (IDE).
Session 2: Key features
Data types; type conversions; files and loops; Boolean statements; if statements; lists and list operations; list comprehensions.
Session 3: Text management
Dictionaries; sets and set operations; regular expressions; creating, iterating over, sorting, accessing keys and values; fast extraction of lists from dictionaries.
Session 4: Programming features
Introduction to functions; variable scopes; loops; modules and classes; debugging and error handling.

Day 2 (Tuesday 28 Nov 2017): Big Data and Networks

Session 1: Big Data
Introduction to big data; structured and unstructured big data; database management systems (DBMS); SQL language.
Session 2: PostgreSQL DBMS
SQL statements; creating a table; inserting data into the table; selecting the data from the table; loading external data into table; joining several tables; indexing table columns.
Session 3: Network Science
Networks in various application domains; mathematical representations of networks; graph theory concepts; definitions for network degree, connectivity, path and more.
Session 4: Network analysis
Big data network visualization; basic characteristics computation of networks; network metrics.

Day 3 (Wednesday 29 Nov 2017): Social media data

Session 1: Web data
Internet; WWW; data available at web sites; Document Object Model (DOM) trees; XPath; selecting nodes from an XML document.
Session 2: Text data
Text data characteristics; regular expressions; parsing unstructured data; loading parsed data into database; management of text data.
Session 3: Social media data
Social Media Application Programming Interface (API), the JavaScript Object Notation (JSON) language for data-interchange.
Session 4: Twitter application
The Twitter API; stream and REST API; connection to Twitter; data collection from Twitter.

Day 4 (Thursday 30 Nov 2017): Network analysis

Session 1: Fundamentals of network analysis
Key network measures and their interpretation; identifying the most influential persons in a social network.
Session 2: Measures
Centrality computation; network visualization.
Session 3: Re-posting application
Analysis of viral advertisement re-posting activity in social media.
Session 4: Financial application
Network-based approaches for mining financial data.

N.B. The precise content per session is subject to reshuffling and fine-tuning.

Expression of Interest

Numbers are limited and places are reserved on a first-come first-served basis following the completion of the online form:

Expression of Interest