Is this a good word embedding model for text analysis?

  • Thread starter Thread starter Trollfaz
  • Start date Start date
  • Tags Tags
    Text
Click For Summary
The discussion revolves around building an AI model to group texts into topics using K clustering, focusing on embedding text into vectors. The proposed method involves tokenizing text into words, converting them to lowercase, and one-hot encoding characters into a 26-dimensional vector. The vectors are then processed sequentially through a Recurrent Neural Network (RNN) to maintain character sequence. The mathematical framework outlines how the RNN state updates with each character input, using a transformation matrix W. A key question raised is whether W can be initialized randomly or needs to be adjusted through training. However, it is clarified that the approach is character-based rather than word-based, indicating that it does not utilize traditional word embedding techniques. Additionally, the validity of the initial assumption about tokenization is questioned, highlighting that the proposed method lacks actual word tokenization.
Trollfaz
Messages
143
Reaction score
14
I am trying to build an AI to group texts into topics using K clustering so I must embed a text into a vector. Everyone knows that the conventional way is to first tokenize the text into words and stripping all the punctuation chars. I get a list of word from this. This is what I plan to do with each word.
1) Convert into to lower
2)One hot encode all characters so the inputs will be a 26 dim vector/array.
3)Feed the arrays/vectors into a RNN sequentially so that the sequence of the characters is known.
Okay this is the mathematical description. The RNN before processing the nth char is at state ##s_{n-1}##. The encoded vector of the nth char is ##c_n##.
$$s_0=0$$
$$s_n=s_{n-1}+Wc_n$$
W here is a 26 by 26 transformation matrix. Return ##s_n## when all chars have been processed. If one wants to emphasize the significance of capital letters we can assign a value of 2 instead of 1 at the index of the one hot encoded vector for the char
But my main problem is how to get W, can it be random or must the values be adjusted through training?
 
Computer science news on Phys.org
Trollfaz said:
Is this a good word embedding model for text analysis?
No, it's not a word embedding model at all - your model is character based, not word based.

Trollfaz said:
Everyone knows that the conventional way is to first tokenize the text into words and stripping all the punctuation chars.
Does everyone know that? I think I could probably find an exception.

Trollfaz said:
This is what I plan to do with each word...
There is no tokenization of words in the process you describe. There are many descriptions of tokenization on the interweb e.g. https://medium.com/@james.moody/preparing-text-for-rnn-language-modeling-in-python-6474c3eae66e
 
I’ve spent nearly my entire life online, and witnessed AI become integrated into our lives. It’s clear that AI is apart of us now whether we like it or not, unless your a anti tech cabin lover. AI has some form of control over your life. But what I’ve seen very recently is that people are loosing their ingenuity and deciding to use AI. I feel as if it’ll bleed into STEM which is kinda has already and, every idea or thought could become fully reliant on AI. Yeah AI makes life easier but at a...

Similar threads

Replies
15
Views
2K
  • · Replies 5 ·
Replies
5
Views
3K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 6 ·
Replies
6
Views
3K
Replies
1
Views
4K
  • · Replies 1 ·
Replies
1
Views
6K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 3 ·
Replies
3
Views
8K
  • · Replies 1 ·
Replies
1
Views
10K
  • · Replies 3 ·
Replies
3
Views
2K