Is this a good word embedding model for text analysis?

  • Thread starter Thread starter Trollfaz
  • Start date Start date
  • Tags Tags
    Text
AI Thread Summary
The discussion revolves around building an AI model to group texts into topics using K clustering, focusing on embedding text into vectors. The proposed method involves tokenizing text into words, converting them to lowercase, and one-hot encoding characters into a 26-dimensional vector. The vectors are then processed sequentially through a Recurrent Neural Network (RNN) to maintain character sequence. The mathematical framework outlines how the RNN state updates with each character input, using a transformation matrix W. A key question raised is whether W can be initialized randomly or needs to be adjusted through training. However, it is clarified that the approach is character-based rather than word-based, indicating that it does not utilize traditional word embedding techniques. Additionally, the validity of the initial assumption about tokenization is questioned, highlighting that the proposed method lacks actual word tokenization.
Trollfaz
Messages
143
Reaction score
14
I am trying to build an AI to group texts into topics using K clustering so I must embed a text into a vector. Everyone knows that the conventional way is to first tokenize the text into words and stripping all the punctuation chars. I get a list of word from this. This is what I plan to do with each word.
1) Convert into to lower
2)One hot encode all characters so the inputs will be a 26 dim vector/array.
3)Feed the arrays/vectors into a RNN sequentially so that the sequence of the characters is known.
Okay this is the mathematical description. The RNN before processing the nth char is at state ##s_{n-1}##. The encoded vector of the nth char is ##c_n##.
$$s_0=0$$
$$s_n=s_{n-1}+Wc_n$$
W here is a 26 by 26 transformation matrix. Return ##s_n## when all chars have been processed. If one wants to emphasize the significance of capital letters we can assign a value of 2 instead of 1 at the index of the one hot encoded vector for the char
But my main problem is how to get W, can it be random or must the values be adjusted through training?
 
Computer science news on Phys.org
Trollfaz said:
Is this a good word embedding model for text analysis?
No, it's not a word embedding model at all - your model is character based, not word based.

Trollfaz said:
Everyone knows that the conventional way is to first tokenize the text into words and stripping all the punctuation chars.
Does everyone know that? I think I could probably find an exception.

Trollfaz said:
This is what I plan to do with each word...
There is no tokenization of words in the process you describe. There are many descriptions of tokenization on the interweb e.g. https://medium.com/@james.moody/preparing-text-for-rnn-language-modeling-in-python-6474c3eae66e
 
Well, the date has now passed, and Windows 10 is no longer supported. Hopefully, the readers of this forum have done one of the many ways this issue can be handled. If not, do a YouTube search and a smorgasbord of solutions will be returned. What I want to mention is that I chose to use a debloated Windows from a debloater. There are many available options, e.g., Chris Titus Utilities (I used a product called Velotic, which also features AI to prevent your computer from overheating etc...
I've been having problems for the past few weeks with the display on my Dell computer. I bought the computer new back in 2019 or so, which makes it about 6 years old. My monitor is a 27" HP monitor that I bought for another computer (an HP Pavilion), recently demised, back in about 2012 or 2013. As far as I can tell, the computer, which is running a 10-core Xeon Scalable processor, is functioning as it should. The first symptom was that the screen would go dark, which I would attempt to...

Similar threads

Back
Top