TR_*_*K21 5 r emoji topic-modeling data-preprocessing
I have to do a topic modeling based on pieces of texts containing emojis with R. Using the replace_emoji() and replace_emoticon functions let me analyze them, but there is a problem with the results.
A red heart emoji is translated as "red heart ufef". These words are then treated separately during the analysis and compromise the results.
Terms like "heart" can have a very different meaning as can be seen with "red heart ufef" and "broken heart"
The function replace_emoji_identifier() doesn't help either, as the identifiers make an analysis hard.
Dummy data set reproducible with by using dput() (including the step force to lowercase:
Emoji_struct <- c(
list(content = " wow", " look at that", "this makes me angry", "?\ufe0f, i love it!"),
list(content = "", " thanks for helping", " oh no, why? ", "careful, challenging ???")
)
Run Code Online (Sandbox Code Playgroud)
Current coding (data_orig is a list of several files):
library(textclean)
#The rest should be standard r packages for pre-processing
#pre-processing:
data <- gsub("'", "", data)
data <- replace_contraction(data)
data <- replace_emoji(data) # replace emoji with words
data <- replace_emoticon(data) # replace emoticon with words
data <- replace_hash(data, replacement = "")
data <- replace_word_elongation(data)
data <- gsub("[[:punct:]]", " ", data) #replace punctuation with space
data <- gsub("[[:cntrl:]]", " ", data)
data <- gsub("[[:digit:]]", "", data) #remove digits
data <- gsub("^[[:space:]]+", "", data) #remove whitespace at beginning of documents
data <- gsub("[[:space:]]+$", "", data) #remove whitespace at end of documents
data <- stripWhitespace(data)
Run Code Online (Sandbox Code Playgroud)
Desired output:
[1] list(content = c("fire fire wow",
"facewithopenmouth look at that",
"facewithsteamfromnose this makes me angry facewithsteamfromnose",
"smilingfacewithhearteyes redheart \ufe0f, i love it!"),
content = c("smilingfacewithhearteyes smilingfacewithhearteyes",
"smilingfacewithsmilingeyes thanks for helping",
"cryingface oh no, why? cryingface",
"careful, challenging crossmark crossmark crossmark"))
Run Code Online (Sandbox Code Playgroud)
Any ideas? Lower cases would work, too. Best regards. Stay safe. Stay healthy.
回答
\n将默认转换表替换为replace_emoji删除了空格/标点符号的版本:
hash2 <- lexicon::hash_emojis\nhash2$y <- gsub("[[:space:]]|[[:punct:]]", "", hash2$y)\n\nreplace_emoji(Emoji_struct[,1], emoji_dt = hash2)\nRun Code Online (Sandbox Code Playgroud)\n例子
\n单个字符串:
\nreplace_emoji("wow! that is cool!", emoji_dt = hash2)\n#[1] "wow! facewithopenmouth that is cool!"\nRun Code Online (Sandbox Code Playgroud)\n字符向量:
\nreplace_emoji(c("1: ", "2: "), emoji_dt = hash2)\n#[1] "1: smilingfacewithsmilingeyes "\n#[2] "2: smilingfacewithhearteyes "\nRun Code Online (Sandbox Code Playgroud)\n列表:
\nlist("list_element_1: ", "list_element_2: \xe2\x9d\x8c") %>%\n lapply(replace_emoji, emoji_dt = hash2)\n#[[1]]\n#[1] "list_element_1: fire "\n#\n#[[2]]\n#[1] "list_element_2: crossmark "\nRun Code Online (Sandbox Code Playgroud)\n基本原理
\n要将表情符号转换为文本,replace_emoji请使用lexicon::hash_emojis转换表(哈希表):
head(lexicon::hash_emojis)\n# x y\n#1: <e2><86><95> up-down arrow\n#2: <e2><86><99> down-left arrow\n#3: <e2><86><a9> right arrow curving left\n#4: <e2><86><aa> left arrow curving right\n#5: <e2><8c><9a> watch\n#6: <e2><8c><9b> hourglass done\nRun Code Online (Sandbox Code Playgroud)\n这是类的一个对象data.table。我们可以简单地修改y这个哈希表的列,以便删除所有空格和标点符号。请注意,这还允许您添加新的 ASCII 字节表示形式和随附的字符串。