将主题建模结果投射到数据框

xav*_*avi 5 nlp python-3.x pandas topic-modeling bert-language-model

我用BertTopicwith从一些中KeyBERT提取一些topicsdocs

from bertopic import BERTopic
topic_model = BERTopic(nr_topics="auto", verbose=True, n_gram_range=(1, 4), calculate_probabilities=True, embedding_model='paraphrase-MiniLM-L3-v2', min_topic_size= 3)
topics, probs = topic_model.fit_transform(docs)
Run Code Online (Sandbox Code Playgroud)

现在我可以访问topic name

freq = topic_model.get_topic_info()
print("Number of topics: {}".format( len(freq)))
freq.head(30)

   Topic    Count   Name
0   -1       1     -1_default_greenbone_gmp_manager
1    0      14      0_http_tls_ssl tls_ssl
2    1      8       1_jboss_console_web_application
Run Code Online (Sandbox Code Playgroud)

并检查主题

[('http', 0.0855701486234524),          
 ('tls', 0.061977919455444744),
 ('ssl tls', 0.061977919455444744),
 ('ssl', 0.061977919455444744),
 ('tcp', 0.04551718585531556),
 ('number', 0.04551718585531556)]

[('jboss', 0.14014705432060262),
 ('console', 0.09285308122803233),
 ('web', 0.07323749337563096),
 ('application', 0.0622930523123512),
 ('management', 0.0622930523123512),
 ('apache', 0.05032395169459188)]
Run Code Online (Sandbox Code Playgroud)

我想要的是最终数据,frame其中一个包含column以下元素topic name,另一个包含column以下元素topic

expected outcome:

  class                         entities
o http_tls_ssl tls_ssl           HTTP...etc
1 jboss_console_web_application  JBoss, console, etc
Run Code Online (Sandbox Code Playgroud)

以及一个数据框,其主题名称位于不同的列上

  http_tls_ssl tls_ssl           jboss_console_web_application
o http                           JBoss
1 tls                            console
2 etc                            etc
Run Code Online (Sandbox Code Playgroud)

我不知道该怎么做。有办法吗?

Lau*_*ent 3

这是一种方法:

设置

import pandas as pd
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset="all", remove=("headers", "footers", "quotes"))["data"]

topic_model = BERTopic()
# To keep the example reproducible in a reasonable time, limit to 3,000 docs
topics, probs = topic_model.fit_transform(docs[:3_000])

df = topic_model.get_topic_info()
print(df)
# Output
   Topic  Count                    Name
0     -1     23         -1_the_of_in_to
1      0   2635         0_the_to_of_and
2      1    114          1_the_he_to_in
3      2    103         2_the_to_in_and
4      3     59           3_ditto_was__
5      4     34  4_pool_andy_table_tell
6      5     32       5_the_to_game_and
Run Code Online (Sandbox Code Playgroud)

第一个数据框

使用 Pandas字符串方法

df = (
    df.rename(columns={"Name": "class"})
    .drop(columns=["Topic", "Count"])
    .reset_index(drop=True)
)

df["entities"] = [
    [item[0] if item[0] else pd.NA for item in topics]
    for topics in topic_model.get_topics().values()
]

df = df.loc[~df["class"].str.startswith("-1"), :]  # Remove -1 topic

df["class"] = df["class"].replace(
    "^-?\d+_", "", regex=True
)  # remove prefix '1_', '2_', ...
Run Code Online (Sandbox Code Playgroud)
print(df)
# Output
                  class                                                      entities
1         the_to_of_and                [the, to, of, and, is, in, that, it, for, you]
2          the_he_to_in               [the, he, to, in, and, that, is, of, his, year]
3         the_to_in_and             [the, to, in, and, of, he, team, that, was, game]
4           ditto_was__  [ditto, was, <NA>, <NA>, <NA>, <NA>, <NA>, <NA>, <NA>, <NA>]
5  pool_andy_table_tell  [pool, andy, table, tell, us, well, your, about, <NA>, <NA>]
6       the_to_game_and           [the, to, game, and, games, espn, on, in, is, have]
Run Code Online (Sandbox Code Playgroud)

第二个数据框

使用 Pandas转置

print(df)
# Output
                  class                                                      entities
1         the_to_of_and                [the, to, of, and, is, in, that, it, for, you]
2          the_he_to_in               [the, he, to, in, and, that, is, of, his, year]
3         the_to_in_and             [the, to, in, and, of, he, team, that, was, game]
4           ditto_was__  [ditto, was, <NA>, <NA>, <NA>, <NA>, <NA>, <NA>, <NA>, <NA>]
5  pool_andy_table_tell  [pool, andy, table, tell, us, well, your, about, <NA>, <NA>]
6       the_to_game_and           [the, to, game, and, games, espn, on, in, is, have]
Run Code Online (Sandbox Code Playgroud)
print(other_df)
# Output
  the_to_of_and the_he_to_in the_to_in_and ditto_was__ pool_andy_table_tell the_to_game_and
0           the          the           the       ditto                 pool             the
1            to           he            to         was                 andy              to
2            of           to            in        <NA>                table            game
3           and           in           and        <NA>                 tell             and
4            is          and            of        <NA>                   us           games
5            in         that            he        <NA>                 well            espn
6          that           is          team        <NA>                 your              on
7            it           of          that        <NA>                about              in
8           for          his           was        <NA>                 <NA>              is
9           you         year          game        <NA>                 <NA>            have
Run Code Online (Sandbox Code Playgroud)