Dataset Dictionary
Analyzing datasets against this reference list I'm maintaining - please feel free to add in here if you feel like it!
Dataset | Type(s) | Domain(s) |
---|---|---|
AMI | You-Usage Detection | Naturally Occuring Meetings |
coconut_corpus | Tracking Coordination in Task-oriented Discourse | Computer Mediated Dialogues |
argument-facet-similarity | Text Similarity | Online Discussions |
AMI | Text Segmentation | Naturally Occuring Meetings |
spaadia | Syntactic, Semantic, Pragmatic Analysis | Transactional Dialogue Files |
sri_amex_corpus | Syntactic, Semantic, Pragmatic Analysis | Audiotaped Transcriptions of Travel Agent Communications |
swda-master | Syntactic, Semantic, Pragmatic Analysis | Telephone Speech Corpus |
STAC | Surface Acts Classification | Strategic Chat Conversations |
AVDIAR-All | Speech Overlap | Unstructured Informal Meetings |
AVDIAR-All | Speech Activity Detection | Unstructured Informal Meetings |
Artwalk | Referential Communication | Mobile and Skype Conversations |
AMI | Phoneme Recognition | Naturally Occuring Meetings |
dialog_bank | Interoperable Semantic Annotation | Dialog Corpora |
MULTIWOZ2.1 | Intent Classification and Slot Labeling | Human-Human Written Conversations |
candela_data-20200504T045203Z | Function Word Classification | Reddit Threads |
AMI | Disfluency Detection | Naturally Occuring Meetings |
hcrcmaptask.nxtformatv2-1 | Disfluency Detection | Human Communication Recordings |
spaadia | Disfluency Detection | Transactional Dialogue Files |
sri_amex_corpus | Disfluency Detection | Audiotaped Transcriptions of Travel Agent Communications |
dialog_bank | Discourse Relational Analysis | Dialog Corpora |
hcrcmaptask.nxtformatv2-1 | Dialogue Structure Coding (Discourse Analysis) | Human Communication Recordings |
AMI | Dialogue Acts Classification | Naturally Occuring Meetings |
coconut_corpus | Dialogue Acts Classification | Computer Mediated Dialogues |
dialog_bank | Dialogue Acts Classification | Dialog Corpora |
hcrcmaptask.nxtformatv2-1 | Dialogue Acts Classification | Human Communication Recordings |
icsi_mrda+hs_corpus_050512 | Dialogue Acts Classification | Informal Natural Impromptu Meetings |
LEGOv2 | Dialogue Acts Classification | Spoken Dialog interactions with Dialog System |
MULTIWOZ2.1 | Dialogue Acts Classification | Human-Human Written Conversations |
spaadia | Dialogue Acts Classification | Transactional Dialogue Files |
sri_amex_corpus | Dialogue Acts Classification | Audiotaped Transcriptions of Travel Agent Communications |
STAC | Dialogue Acts Classification | Strategic Chat Conversations |
swda-master | Dialogue Acts Classification | Telephone Speech Corpus |
textfeats | Dialogue Acts Classification | HCRC MapTask Corpus |
transcripts (1) | Dialogue Acts Classification | HCRC MapTask Corpus |
desiredb | Desire Fulfillment in Natural Language Text | First Person Informal Narratives |
coconut_corpus | Conversational Record Formalization | Computer Mediated Dialogues |
coconut_corpus | Computational Treatment of Discourse Structure | Computer Mediated Dialogues |
AMI | Argumentation Mining | Naturally Occuring Meetings |
internet-argument-corpus | Agreement Frame Detection | Online Debate Forums |
AMI | Adjacency Pair Recognition | Naturally Occuring Meetings |
icsi_mrda+hs_corpus_050512 | Adjacency Pair Recognition | Informal Natural Impromptu Meetings |
AESLC-master
This is a dataset for Subject of an Email - could be used for Topic Modeling, but not in the context of dialogues - because the domain is very different. However seeing the size of this dataset, pre-training on this data might not be a bad idea.
AMI
Argument-Dialogue-Summary
This would be a very useful dataset for Conversational Summarization. It also provides multiple options of summaries, so that a machine can return multiple different versions and we can rank them in order. Another thing I find interesting is how the input data is fit into one text corpus - easier to feed to a model, where it can understand Speaker-1 and Speaker-2. However, the domain is not Speech Conversations where one might say something without thinking - its online debates.
Argument Facet Similarity
This would a great dataset for dialogue-similarity definition. Two arguments are similar if they both have the same context - can be useful for tracking a topic throughout a conversation and its similarity. It also has a regression label of similarity degree. However, the domain is not direct conversation, but rather online discussions - where people have more time to think - may have spelling errors but no social cues as that in speech.
Argument Quality represents how much this argument was similar to others on two scales (2-1) and (3). Could be useful for finding the best type of dialogue with which to match other dialogues with, and would therefore be the best for extractive summarization.
ArtWalk
Artwalk captures how conversation sounds when one partner has to walk, talk and navigate everyday obstacles in a small city environment, such as cars and pedestrians. Corpus consists of transcripts of 24 friend and 24 stranger dyads who did this two-round referential communication task. In total, it contains approximately 185,000 words and 23,000 turns, from conversations that ranged from 24 to 55 minutes. It includes referent negotiation, direction-giving and small talk (non-task talk). Additional information includes selected post-experiment questionnaire responses, participant genders, sunset times, weather conditions, photos of the targets and their coordinates.
This might be a good dataset for training on successful coreference resolution in terms of the giving and following instructions in a conversation, between multiple speakers. The domain is both mobile and Skype conversations, so it is definitely interesting.
Arxiv Dataset
This dataset is useful for Extractive Summarization of Text data, however, probably not useful in the context of dialogue data - unless we are looking to summarize monologues.
Auto-hMDS-master
This dataset is useful for Extractive Summarization of Text data and Topic Modeling, however, probably not useful in the context of dialogue data - unless we are looking to summarize monologues. A good thing about this is that it is domain independant - as in no two texts are from the same domain - so would be useful in a variety of scenarios.