License-Aware Web Crawling | Natural Language Processing Group

Training large generative AI models needs enormous amounts of data. Current models often do not respect copyrights. In order to train models that respect copyrights, we need huge data sets with texts and images that use open licenses. The goal of this project is to use Natural Language Processing in order to detect licenses on websites. Since different elements on a website can be licensed differently (e.g. a Creative Commons image in a proprietary text), one of the main challenges is to detect to which parts of a website a mentioned license applies to.