In the following we publish the speech manuscript for a lecture at the Academy of Sciences on June 8, 2020, in which Daniel Spitzer participated as a representative of 100W. The topic of the lecture was the ethically responsible use of artificial intelligence. The members of the committee come from a wide variety of fields, including Prof. Dr. Jessica Burgner-Kahrs (law), Prof. Dr. Olaf Dössel (medicine), Prof. Dr. Anja Feldmann (computer science), Prof. Dr. Carl Friedrich Gethmann (philosophy) and others. The spokesperson of the board is Prof. Dr. Christoph Markschies.
What kind of AI do you work with or develop?
We have very specific requirements for AI. The goal is to improve our analysis, our interpretation. In doing so, AI can help to solve specific problems that arise when evaluating a speech. First of all, the problem of the ambiguity of terms has to be mentioned here.
But let's take a step back. Basically, our text analysis is a rule-based procedure. Dictionaries are stored that collected words belong to a latent category such as "relationship motive". The idea of measuring personality in this way goes back to the property-theoretical approach from psychological diagnostics. This postulates that personality traits can be explained by descriptive terms. A frequently cited example is the trait "sociability". Terms like "party", "going out", "dancing", "visiting friends" would be descriptive terms for this trait. The probably most famous psychological traits, the Big Five, were also developed from the trait-theoretical approach. Other categories developed for language analytical procedures are, for example, the motives according to McClelland or the categories of Regulatory Focus according to Higgins.
The dictionaries on which the analysis is based are created according to a specific procedure. This procedure is documented in our manual and freely available on our website.
But why do we need artificial intelligence? Artificial intelligence helps us, for example, to decide what meaning a term has. The term "going out", which was used above to describe the characteristic "sociability", is ambiguous. Depending on the context, the word "going out" can have different meanings. In a social context, it expresses conviviality. In an economic context, the word "going out" can mean that something is coming to an end. A rule-based approach fails here. It is not possible to decide what the word means in different contexts. AI offers a solution here. With so-called language models the meaning can be anticipated.
We use the language model BERT (Bidrectional Encoder Representations from Transformers), which was introduced by Google AI Language researchers in 2018 and trained for the German language by deepset.ai.
How is the system programmed and trained? Is it supervised or unsupervised learning?
There are two steps to be distinguished. First the setting up and training of BERT, which was done before we used it at 100W. And the adjustments we made to it for our purposes. After all, BERT was not developed for the purpose of disambiguation, but in our opinion it provides the most suitable method for that purpose. Our task was to prepare BERT so that it could be used for the disambiguation task.
As already mentioned, BERT 2018 was presented by researchers from Google AI Language. BERT belongs to the class of transfomers (which are most comparable to RNN [recurrent neural networks]) and is trained with masked token prediction. The training for the German language is provided by deepset.ai. As training data they used texts from Wikipedia, news and openlegaldata, which contains over 100T judgements of German courts. In total they used over 12 GB of textcorpi for the training. This training was done unsupervised. This is not about optimizing an outcome but about "learning" the language structure. This forms the basis. For the purpose of disambiguation, however, further steps are necessary. The procedure I describe in the following is a development of 100 words.
BERT converts words into vectors. For each word a vector results depending on the context. This vector can be different for the same word (e.g. "out"). It all depends on the context. Now we only have to decide which meaning and, depending on it, which context is the right one for us. So in what context does the word "going out" have the meaning of conviviality? To decide this, we first need some example sentences that carry the desired meaning of the word. For this purpose we have developed a tool that allows people to select different sentences from a large corpus in such a way that they cover the space of the occurring vectors as best as possible. These sentences are then displayed. The human being must then decide whether the desired meaning is contained in the sentence. As a result we then get sentences which all carry the desired meaning of "going out". The resulting specific vectors serve as standard. As internal benchmark tests have shown, 15-20 sentences are already sufficient to form a standard. We always have several vectors of sentences in which the word has the "right" meaning and several vectors of sentences with word senses that do not match. And for a new sentence that we want to classify, we compare which label the next vector has. It doesn't matter how close it is then, as long as it is closer than all other vectors.
Where does the data come from?
The data for generating the sentences are partly freely accessible (e.g. speeches of the European Parliament, speeches of the board of directors, job advertisements), partly own collections (e.g. Twitter). We have tried to achieve the broadest possible coverage of language events in the collection.
For the specific use cases of our analysis we need data to be able to make comparisons. We often collect this data ourselves. Because here we have very specific requirements for the context and the use case.
Why do you develop/use AI? What is the benefit/added value of using AI?
I have already described a benefit of AI. Furthermore we use the possibilities of the language model BERT for our so-called recommendations and insertions.
One of the paradedisciplines of BERT is the insertion of words into a sentence. To do this, you mask a word in a sentence and let the language model find the appropriate insertion. BERT achieves very good results. We make use of this function by using it in our so-called Augmented Writer as a formulation aid. Imagine a text that has little emotional impact. The aim is to make this text more emotional. Now we can use BERT to suggest emotionally connoted words at appropriate places. This will quickly improve the linguistic effect of the text. However, BERT does not decide what positively connoted words are. Instead, it uses the 100-word dictionary and compares whether one of the potential replacements is available in the corresponding dictionary. If this is the case, the word is suggested as an alternative formulation in the Augmented Writer. With a picture: We put the AI on a leash by using the possibilities for our purposes and thereby limiting it for special applications.
Fairness and justice?
A big disadvantage of learning procedures is the lack of traceability of a result, especially in the diagnostic area. A justified demand of the HR Ethics Council is the possibility to be able to understand how a result comes about when it comes to a diagnostic assessment. We make the same assessment in the area of implicit motives. Because there we use our rule-based procedure and use artificial intelligence only to correctly assess ambiguous words, the result can be understood. For this purpose, at the end of a test, we display the words that are responsible for achieving a result. We would not be able to achieve this if we relied solely on artificial intelligence.
Is the objective of the AI system clearly defined?
The goal of the AI system is to support our rule-based analysis. It supports us in the disambiguation of word meanings and in the enrichment of texts with new terms. For exploratory purposes, for example to compare the predictive power of our rule-based approach with trained procedures, we also conduct training sessions ourselves. So far, however, there is no market-ready application that bases the prediction of a psychological feature on the judgement of an AI. However, we are working on increasing the use of AI in prediction. AI is not an end in itself, but we only use it if a significant improvement can be achieved. However, the traceability of the results must always be maintained. How the combination of both is possible is just one development topic at 100 words.
What ethical considerations follow goal definition and development?
Industry standards or regulatory incentives?
The goals we always pursue in the development of our tools are those formulated according to high test quality as in psychological diagnostics. This applies first and foremost to our rule-based approach. High test quality means that our analysis functions objectively, i.e. it is independent of the person evaluating it, is reliable, i.e. it measures reliably and functions validly, i.e. it actually measures what it claims to measure. In order to prove this, we refer on the one hand to work that has already been carried out by other scientists. The criteria for including a finding in our system are described in our manual. But of course, simply adopting the findings of others is not enough to prove the test quality of our procedure. Therefore we also carry out our own studies. In this context, proof of validity and reliability is of particular interest to us.
However, validation studies are subject to high requirements. The proof of validity requires a certain number of cases. In addition, it requires a criterion that is generally accepted. In addition, the data set should have been created independently of the company, as otherwise conflicts of interest could not be ruled out. Because of this difficulty, we cannot validate at will, but must conduct intensive research to locate such data sets. Despite these difficulties, we have succeeded in validating our motive measurement with now three large independent data sets. The results of this validation have been published on our website and are freely accessible there.
We proceed similarly for the use of AI. Before we use an AI system such as BERT, a detailed examination is necessary. This consists of so-called performance tests. Such tests have either already been carried out and documented by other scientists (and we only use the results) or we carry out such tests ourselves. On the basis of such testing, two of our employees wrote a scientific paper which they submitted to a large NLP conference.
Does the system have technical capabilities "by design" to ensure the traceability of decisions?
Yes, see above. The aspect of traceability of results is part of the DNA of our company and will continue to be a central requirement for the development of new applications in the future. We also see traceability as an important competitive advantage. After all, it is also about trust. Customers are much more likely to trust results if they can trace the path they took to achieve them.
The development topic "AI in prediction" mentioned above has been accompanied by the question of how we can make AI explainable since the first development steps. First drafts are already available.
There are detailed instructions for our rule-based procedure, which describe both the psychological background and the technical handling in detail. Furthermore, we embed the results of our motive analysis in a detailed report of the results. Users are therefore not left alone. We also offer introductory web sessions before all of our solutions go live.
How do you ensure that the system does not contain any "bias" in terms of both algorithmic design and data selection? How do you deal with reality reduction and unfair distortions in data sets?
Of course, a feature selection is done by the language model BERT, which we also have no influence on. These features are grouped together on layers, which we can also understand in terms of content. From internal work we know which layer is most important.
How do you ensure fairness, diversity and inclusiveness?
By ensuring that our process is objective and reliable. Objectivity ensures that all participants are treated equally. Reliability ensures that all participants are treated equally, regardless of when an analysis is conducted.
By the way, with one of our applications we promote equality between women and men. We examine the proportion of masculine and feminine words in a text and always recommend (regardless of the application) that these words be used in a balanced proportion. This is so far unique for the German language.
Are you testing the models for specific population groups and problematic cases?
Yes, as long as data sets contain information about different population groups. Unfortunately this is not always the case. But for our motive analysis we were able to generate a subdataset containing non-native speakers. With the help of this data set we were able to adjust the underlying statistics.
When and how is the AI system checked? During commissioning or also during operation?
Both before commissioning and during operation. First, a scientific test of an (AI) system is carried out. Only if the results are promising, the costly development for a product is worthwhile. Once a prototype has been developed, it is tested on our test servers. We carry out these tests in-house. Testing outside would be very expensive and probably it would also be difficult to find the necessary expertise to evaluate a development outside the company. Once a system has passed the internal test, it is passed on to our customers.
Whether it meets the objectives or whether the objectives meet higher ethical criteria (for example fairness)?
-> Test quality criteria
Who checks the AI system? The developers themselves or other/independent bodies?
AI systems are often open-source projects that are developed and tested by many people. For example, BERT was developed by a Google research group and has been discussed extensively by the scientific community. So BERT is not a backroom development. The customizations that have been made specifically for the needs of our text analysis are an in-house development. Therefore, it was initially only reviewed internally. However, by our submission we are striving for transparency in this area as well.
Of course, it is also the case that AI is not tested as extensively as decades-old standard procedures. The novelty is a disadvantage here.
Would you wish for an independent examination and if so, how would such an examination have to be carried out?
We attach great importance to cooperating with research institutions. This is demonstrated by our many contacts and joint projects with universities and colleges. This cooperation often serves to review one aspect of our rule-based approach. Most recently, as part of our cooperation with the University of Giessen, an experiment was set up to test the relationship between regulatory focus and language.
Are there clear responsibilities within the company for the audit of the AI systems?
Yes, the RanD team is responsible for developing our analysis. Of course we cannot take responsibility for BERT. We consider the scope of the decision that BERT makes in our application, namely the decision on the meaning of a term and the decision on the insertion or replacement of a term by another, to be manageable.
At no point do we make recommendations that decide on individuals. Even in our motivational analysis, which is most likely to serve a diagnostic purpose, we do not make a decision (in the manner of "Person A is suitable, Person B is not"), but only show results. We explicitly state that the decision about a setting cannot be made by the program.
When do we take countermeasures in case of errors and by whom?
At best, we discover errors before the program is put into operation. Nevertheless, it is possible that faulty code may creep into the productive solutions. We carry out standard spot checks with our productive solutions and keep an eye out for anomalies. Since we usually sell our solutions with a three-month pilot phase, there is also the possibility of testing by the customer. During this phase, customers have the opportunity to ask questions to employees of 100 words in a video call and/or report errors.
Are there contingency plans in case of errors?
Yes, the so-called hotfix.
What rights do the (end) users* have? Are the (end) users informed that they are interacting with an AI system?
Yes, we inform in our manual about how our analysis works. In the motive analysis we also explicitly mention the use of our analysis. The fact that we use BERT (the actual AI in our system) is only mentioned in the manual, but not in the applications.
Do (end) users* have the possibility to get information about the making of AI decisions?
Yes, as already mentioned, we make the making of our results transparent. In addition, we explain how our analysis works with a manual and the psychological background in our instructions. In the Motive Report, we also offer a detailed presentation of the background and a precise documentation of how our results came about.
Do (end) users* have the possibility to complain about AI decisions?
Yes, the weekly video calls during the pilot phase serve this purpose. In addition, you can also report problems with our analysis to us at any time.
How do you make sure that people have the skills to check the outcome/failure of AI systems?
For pure AI systems such verifiability cannot be guaranteed. This is because AI systems would have to be traceable, which they are not.
The situation is different for rule-based procedures. However, here too, it cannot be assumed that users will penetrate the procedure in depth.
Users of our solutions are usually experts in their field and therefore have a certain expertise. They usually have a very good feeling for how realistic an achieved result is. We actively encourage customers to look at the results of our analysis and, if necessary, to compare them with other, similar procedures. We do not shy away from discussing possible discrepancies; on the contrary, we like to take up such difficulties because they help us in the development of our tool.
How can developers or users maintain or obtain the necessary competence about the functioning and effects of AI systems?
On the one hand, the above-mentioned documents serve this purpose. On the other hand, web sessions help to develop an understanding for our analysis. We release our analysis for research purposes.
Do we lose the competence to adequately assess AI-based decisions due to the excessive dependence on AI systems?
It is naturally very difficult to assess decisions that are made solely by AI-based procedures. The complexity of AI can only be understood by experts in this field. But even for them it is not possible to indicate how a result was achieved. Therefore, I personally - and here I am speaking for 100 words - advise against blind trust in AI systems. Two things are particularly important to me: AI is not an end in itself but should be used as specifically as possible. It should be used as specifically as possible, because this at least explains why artificial intelligence is used. In addition, by using AI as specifically as possible, the outcome of the AI becomes clearer and thus more assessable. My second point: Before using AI, the AI system needs to be thoroughly tested. This provides information about the behaviour of the AI and thus makes it more predictable. In order for AI to be useful, it must prove to be useful in this testing. The introduction of target values is therefore essential.
In general, can people take responsibility for AI-based decisions? Why yes or no? What are the arguments in favour, what are the arguments against?
In my view, responsibility can only be taken for the way an AI is constructed. It is necessary to do everything possible to design the conditions in such a way that the result is as ethically sound as possible. This includes the responsible development of AI codes, the careful selection and review of training data sets and extensive testing. And, of course, it also includes responsible use. Especially important: Whenever possible, decisions should be left to people and not be made by the machine. The development of and adherence to standards is desirable but also questionable in the rapid development of AI.
On the other hand, it is difficult to take responsibility for the decision an AI makes. Because responsibility also implies the ability to be accountable. But how should this be possible with black-box AI? The solution we found at 100 words is to combine rule-based procedures with AI systems.
Who can make an overriding decision not to use an AI system or to switch it off?
Switching off the AI component in our analysis is possible in principle, but not as an easily selectable option for customers. If there are concerns about the AI, our analysis can be used completely without this component. However, the results will then be worse, because ambiguous words, for example, will no longer be recognized.