Is this a good word embedding model for text analysis?

  • Thread starter Thread starter Trollfaz
  • Start date Start date
  • Tags Tags
    Text
Click For Summary
SUMMARY

The discussion centers on the use of a character-based model for text analysis, specifically utilizing K clustering and RNNs for embedding text into vectors. The proposed method involves tokenizing text into characters, one-hot encoding them, and processing them through an RNN with a transformation matrix W. However, it is clarified that this approach does not constitute a word embedding model, as it lacks traditional word tokenization. The main concern raised is the initialization and training of the transformation matrix W.

PREREQUISITES
  • Understanding of K clustering techniques
  • Familiarity with Recurrent Neural Networks (RNNs)
  • Knowledge of one-hot encoding methods
  • Basic concepts of text tokenization
NEXT STEPS
  • Research K clustering algorithms for text topic grouping
  • Learn about RNN architecture and training techniques
  • Explore one-hot encoding and its applications in NLP
  • Study various text tokenization methods and their implications
USEFUL FOR

Data scientists, machine learning engineers, and NLP practitioners looking to implement text analysis techniques using character-based models and RNNs.

Trollfaz
Messages
144
Reaction score
16
I am trying to build an AI to group texts into topics using K clustering so I must embed a text into a vector. Everyone knows that the conventional way is to first tokenize the text into words and stripping all the punctuation chars. I get a list of word from this. This is what I plan to do with each word.
1) Convert into to lower
2)One hot encode all characters so the inputs will be a 26 dim vector/array.
3)Feed the arrays/vectors into a RNN sequentially so that the sequence of the characters is known.
Okay this is the mathematical description. The RNN before processing the nth char is at state ##s_{n-1}##. The encoded vector of the nth char is ##c_n##.
$$s_0=0$$
$$s_n=s_{n-1}+Wc_n$$
W here is a 26 by 26 transformation matrix. Return ##s_n## when all chars have been processed. If one wants to emphasize the significance of capital letters we can assign a value of 2 instead of 1 at the index of the one hot encoded vector for the char
But my main problem is how to get W, can it be random or must the values be adjusted through training?
 
Computer science news on Phys.org
Trollfaz said:
Is this a good word embedding model for text analysis?
No, it's not a word embedding model at all - your model is character based, not word based.

Trollfaz said:
Everyone knows that the conventional way is to first tokenize the text into words and stripping all the punctuation chars.
Does everyone know that? I think I could probably find an exception.

Trollfaz said:
This is what I plan to do with each word...
There is no tokenization of words in the process you describe. There are many descriptions of tokenization on the interweb e.g. https://medium.com/@james.moody/preparing-text-for-rnn-language-modeling-in-python-6474c3eae66e
 
  • Like
Likes   Reactions: berkeman

Similar threads

  • · Replies 5 ·
Replies
5
Views
4K
  • · Replies 15 ·
Replies
15
Views
5K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 4 ·
Replies
4
Views
14K
  • · Replies 6 ·
Replies
6
Views
3K
Replies
1
Views
4K
  • · Replies 4 ·
Replies
4
Views
6K
  • · Replies 1 ·
Replies
1
Views
6K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 7 ·
Replies
7
Views
2K