Gerhard Hagerer

Dr. rer. nat. M.Sc.



Research Questions

  • Which opinions and topics are expressed in texts on social media, i.e. Facebook, Reddit, Twitter, Quora, etc.? How do the respective distributions look like? How reliable are they?
  • How are social media users and their discussions influenced by the media agenda of online newspapers?
  • How is consumer research carried out on social media, and how can modern text analysis methods support it?
  • How to deal with unreliable annotations while training predictive models?

Current Research Interests

  • Market and consumer research on social media comments
  • Unsupervised and weakly supervised methods for aspect extraction and topic modelling
  • Explainability and robustness of short texts classification and clustering
  • Transfer and deep learning methods for natural language processing

Announcements

  • Currently open topics:
    • We started a new research collaboration with an EdTech startup (Jan 2022). We are interested in leveraging the power of NLP and text mining to improve state-of-the-art educational technology, such as, autograding, student submission mining, guided corrections, et cetera. Please read the recent project proposal and send your interested application including a transcript of records and CV to Ivica Pesovski (ivica@brainster.co) with myself (Gerhard Johann Hagerer, ghagerer@mytum.de) on Cc.
    • Elisabeth Wittmann from Hochschule Regensburg is looking for master thesis students on Learning Device-Invariant Representations for IR-Spectra. We highly recommend her and her research topic for state-of-the-art applied deep learning into interesting new domains and use cases.

Master Thesis Topics for Summer Term 2020 (already taken)

Explainable and Efficient Multi-Class Multi-Label Text Classification for Industrial Scenarios Using Deep Pre-Trained Embeddings

Motivation

Recent advancements in deep learning and artificial intelligence algorithms improve the capabilities of automatic text analysis and classification methods considerably. In particular transfer learning, i.e. the utilization of pre-trained deep neural networks and related techniques, is a milestone in that regard.

However, problems occur oftentimes when these machine learning algorithms are applied to real life scenarios, as these are found in industrial settings. Typical issues are too few datapoints for too many classes, which does not fit to the mostly high number of trainable parameters in artificial neural networks. Furthermore, deep learning technology does not provide clear explanations of its reasoning regarding how and why a classification result is derived.

Thus, we compare different modern text classification approaches with respect to their accuracy, efficiency, and explainability on real life data from different, preferably industrial domains. In particular the efficiacy of transfer learning for natural language processing is investigated.

Tasks

Multi-label multi-task classification of texts to predict

  1. categories of products,
  2. legal provisions of paragraphs in contract texts,
  3. opinions in social media comments.

Methods

The following machine learning approaches are compared:

  1. Hierarchical Attention Networks for Document Classification
    • Advantages:
      • Potentially superior in performance
      • Contextualization due to recurrent neural networks
      • Configurable number of hyperparameters
  2. Bag-of-concepts: Comprehending document representation through clustering words in distributed representation
    • Advantages:
      • Explainable AI
      • Transparent implementation
      • Computationally efficient
  3. SS3: A Text Classification Framework for Simple and Effective Early Depression Detection Over Social Media Streams
  4. Baseline: TF-IDF plus support vector machines

The following features are utilized:

  1. fasttext
    • potentially further byte-pair encodings
  2. BERT

Contributions

  • A comparison of different predictive machine learning techniques utilizing state-of-the-art transfer learning for multi-label multi-class document classification.
  • Consideration of explainability, efficiency, and robustness regarding real-world issues such as data sparsity or noise.

Guided Research Topics for Winter Term 2019/2020

We have at least two offers for students who want to work on a guided research topic. The offer includes to summarize previous research, i.e., results from student theses and projects which have been conducted under my guidance as your guided research report. The idea is me guiding you to (learn to) write it as an actual paper which is to be published on an actual scientific conference. Besides of learning how to articulate and present yourself and your ideas when you want to write publication, you get the chance of a second authorship and according citations which both is a great contribution for your CV. Moreover, there might be the possibility that you have to run single experiments again and thus gaining machine learning skills and experience.

The actual topics are all about aspect-based sentiment analysis and varying aspect of machine and deep learning. Some currently relevant examples are the following:

  • A Systematic Comparison of Multiple Deep Pre-Trained Cross-Lingual Embeddings for Sentiment Analysis
    • Summarize this thesis
  • On the Applicability of Progressive Neural Networks (PNNs) for Transfer Learning in Natural Language Processing
    • Summarize everything PNN-related given in the reports we hand out to you
  • Tackling Class Imbalance and Small Data for ABSA by Jointly Applying Data Augmentation, Class Balancing, Transfer Learning, and Class Reduction
    • Master thesis of Sumit Dugar: Aspect-Based Sentiment Analysis Using Deep Neural Networks and Transfer Learning

 

Master Thesis Topics for Winter Term 2019/2020


Multi-Lingual Analysis of Media Agenda Setting and its Relation to Social Media Comments Based on Deep Pre-Trained Embeddings


Requirements:
- M.Sc. student at the TUM computer science or mathematics faculty
- interest in deep learning, unsupervised methods, and natural language processing
- motivation to support a concrete publication idea


Action items:
- data: newspaper articles and corresponding comments from Der Spiegel and New York Times about (organic) food
- clustering:
  - proposed algorithm: OPTICS/DBSCAN
  - feature vectors:
    - Glove word vectors of all the words of the corpus, potentially using multi-lingually aligned word vectors from German, too
    - sentence vectors as obtained by XLING or ABAE
  - main research questions: how many clusters are calculated by OPTICS? how informative are these according
- visualization of the clusters
  - colored downprojection using t-sne or PCA
  - cluster-wise word lists based on clarity score or other built-in methods; manual labeling of the corresponding tag clouds
- calculate normalized document-wise histograms/distributions of cluster assignments of the respective words or sentences
  - a document is either a news article or all concatenated comments below this article
- calculate a correlation between each histogram bin of all newspaper articles and each histogram bin of all comments. The outcome should be a similarity matrix and a respective heatmap
- further, depict the overall distribution of clusters in English and German articles and comments separately
- apart from your thesis report, write up at least the experiments and results if not everything in a two-column paper format additionally

Attention-Based Hierarchical Siamese Networks for Multi-Label Opinion Mining Problems on Sparse Data

Predicting Perception Uncertainty in Aspect-Based Sentiment Analysis (ABSA)

Cross-Lingual Aspect-Based Sentiment Analysis Using Semi-Supervised Learning and Deep Pre-Trained Embeddings

Implementing an Opinion Mining Framework for Crawling and Analyzing Social Media Using Unsupervised Semantic Similarity-Based Aspect Extraction

  • implement an integrated framework and web GUI for opinion researches to automatically perform the following steps:
    • automatically crawl given Facebook/Reddit groups and pages for texts of social media comments
    • define filters based on keywords for relevance
    • cluster the posts in an optimal way using semantic embeddings
      • XLING
      • ABAE
      • LSA
    • provide the option to label the extracted clusters using word lists
    • depict a graph of aspects and their corelation with each other
    • options: a) split up clusters, b) active learning loop to optimize relevance filter or number of clusters
    • produce according topic distributions regarding source, time, amount, and language
    • integrate a given sentiment analyis model 
  • The machine learning components are mostly already available and hyperparameters already known. The key idea is to apply, visualize, and -- in terms of human-computer interaction experiments -- evaluate deep learning based aspect extraction in terms of descriminative statistics, i.e., to give an semantically coherent overview of what humans on social media are talking about by using the most recent state-of-the-art techniques therefore.

 

Ongoing Theses

All of them are already taken. I just leave them here as an overview of my research.


Interesting Reads