Is this a good word embedding model for text analysis?

  • Thread starter Thread starter Trollfaz
  • Start date Start date
  • Tags Tags
    Text
Click For Summary
The discussion revolves around building an AI model to group texts into topics using K clustering, focusing on embedding text into vectors. The proposed method involves tokenizing text into words, converting them to lowercase, and one-hot encoding characters into a 26-dimensional vector. The vectors are then processed sequentially through a Recurrent Neural Network (RNN) to maintain character sequence. The mathematical framework outlines how the RNN state updates with each character input, using a transformation matrix W. A key question raised is whether W can be initialized randomly or needs to be adjusted through training. However, it is clarified that the approach is character-based rather than word-based, indicating that it does not utilize traditional word embedding techniques. Additionally, the validity of the initial assumption about tokenization is questioned, highlighting that the proposed method lacks actual word tokenization.
Trollfaz
Messages
143
Reaction score
14
I am trying to build an AI to group texts into topics using K clustering so I must embed a text into a vector. Everyone knows that the conventional way is to first tokenize the text into words and stripping all the punctuation chars. I get a list of word from this. This is what I plan to do with each word.
1) Convert into to lower
2)One hot encode all characters so the inputs will be a 26 dim vector/array.
3)Feed the arrays/vectors into a RNN sequentially so that the sequence of the characters is known.
Okay this is the mathematical description. The RNN before processing the nth char is at state ##s_{n-1}##. The encoded vector of the nth char is ##c_n##.
$$s_0=0$$
$$s_n=s_{n-1}+Wc_n$$
W here is a 26 by 26 transformation matrix. Return ##s_n## when all chars have been processed. If one wants to emphasize the significance of capital letters we can assign a value of 2 instead of 1 at the index of the one hot encoded vector for the char
But my main problem is how to get W, can it be random or must the values be adjusted through training?
 
Computer science news on Phys.org
Trollfaz said:
Is this a good word embedding model for text analysis?
No, it's not a word embedding model at all - your model is character based, not word based.

Trollfaz said:
Everyone knows that the conventional way is to first tokenize the text into words and stripping all the punctuation chars.
Does everyone know that? I think I could probably find an exception.

Trollfaz said:
This is what I plan to do with each word...
There is no tokenization of words in the process you describe. There are many descriptions of tokenization on the interweb e.g. https://medium.com/@james.moody/preparing-text-for-rnn-language-modeling-in-python-6474c3eae66e
 
A new phenomenon is AI-generated news videos pretending to be by well-known professors Jeffery Sachs and John Mearsheimer. The amazing thing is that they both seem very tolerant of this. Youtube will block these if they request it but this has been going on for months and such blocks never seem to happen. The other surprise is that while they may be visually ugly or even grotesque the news analysis is quite good. If given the sound alone I don't believe I could tell it from the real...

Similar threads

Replies
15
Views
2K
  • · Replies 5 ·
Replies
5
Views
3K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 6 ·
Replies
6
Views
3K
Replies
1
Views
4K
  • · Replies 1 ·
Replies
1
Views
6K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 3 ·
Replies
3
Views
8K
  • · Replies 1 ·
Replies
1
Views
10K
  • · Replies 3 ·
Replies
3
Views
2K