Workshop!
The workshop to conclude the first phase of the shared task has just taken place in Hamburg, Germany. Since this was the first time ever that annotation guidelines were evaluated and compared in such a way, the exact flow of the workshop was open and unknown both for the participants and organizers (read here about the plans and preparations for the workshop). While the full wrap-up of the workshop will take some time, we point out a few things we learned during the workshop in this post.
It became clear quickly that the participants approached the challenge of narrative levels in very different ways and with different goals. While some participants had a clear interest in narratological theory and conceptual development, others were primarily interested in the annotation of discourse phenomena or the creation of data to be used in machine learning. Accordingly, some guidelines solely cover narrative levels, others specify additional annotation categories that are relevant for specific research questions. What also differs are the conditions in which the guidelines have been established: Some are the result of a seminar or class work, some have been written within a funded project, others have only been written to take part in the shared task. In essence, the diversity on many different axes was high.
From the start, it was clear that this is no workshop in which everyone “just” gives a talk about their guidelines and answers a few questions. In fact, there were only a few presentations at the very beginning and some by the organizers – most of the time was spent with plenary discussions and group work, either in the participant or in newly shuffled groups. One of the first task for group work, for instance, was the identification of similarities and differences across the guidelines.
The core evaluation task was the application of a questionnaire, covering the three evaluation dimensions we considered important. This was done by each group for all other guidelines. All questions were to be answered in text form (to be presented in a plenary session) and in addition by assigning points on a four point Likert scale. The ‘correct’ application of the questionnaire was challenging, without a doubt. Some required putting ones self into the shoes of the guideline authors, some required some imagination on potential uses of annotations based on the guidelines. Interestingly, however, the ranking that was established by summarizing points reflected the discursive evaluation very well. Guidelines that have been praised for their theoretical basis achieved high scores in this dimension as well, and with a low variance across the participants. Even if it sounds counter-intuitive at first, a questionnaire with assigned points is an efficient way of generating such results, and it reflects qualitative judgement well (at least in our case).
As a final point in this post: We were delighted about the positive, welcoming and friendly atmosphere. Despite the diverse backgrounds, which usually give plenty occasions for misunderstandings, our participants managed to be constructive and productive, even when criticizing their guidelines. It was an intense three days, but it was totally worth it.