Algorithm for Converting Strings: Find the Best Option

In summary: I just tried to do def my_replace(m): if <some condition>: return <replacement variant 1> return <replacement variant 2>result = re.sub("\w+", my_replace, input)but I couldn't make it work..any ideas ?You need to use the 're.sub' function.See this for more information.
  • #1
Arman777
Insights Author
Gold Member
2,168
192
Let us suppose I have a string in this form,

string = 'CıCCkCnow CwCho CyCou CaCre but CwChat CaCm CıC'

Now I won't to take each word between 'C' and convert it into an upper case letter. For example, the above string should turn into

new_string = 'I know Who You Are but What Am I'

what kind of algorithm is best for this job ? I have come up with something but it seems really long and inefficient. Any ideas ?
 
Last edited by a moderator:
Technology news on Phys.org
  • #4
Arman777 said:
I have come up with something but it seems really long and inefficient.
It's going to be hard for us to tell whether or not we agree with you if we don't see the algorithm.
 
  • #5
The 'sub' function in Python allows you to have it call a function for each string that matches the pattern and return the string that you want to be substituted. That is what you want so that you can replace 'CxC' with 'X'. See this for a description.
 
  • Like
Likes sysprog and Arman777
  • #6
PeterDonis said:
It's going to be hard for us to tell whether or not we agree with you if we don't see the algorithm.
My algorithm was to take the index of each `C` letter. Pair them as 2. Get index values between them and then turn strings into uppercase letters based on these index. However, it is taking too long...
 
  • #7
FactChecker said:
The 'sub' function in Python allows you to have it call a function for each string that matches the pattern and return the string that you want to be substituted. That is what you want so that you can replace 'CxC' with 'X'. See this for a description.
It looks good. I'll try this.
 
  • #8
regex 2021.8.3 is a python package that you can install and use.
 
  • Like
Likes sysprog
  • #9
The hardest part, in general, would be to distinguish between a 'CxC' pattern that you want to replace versus an acronym with two 'C's that should stay as is. IMO, it is a mistake to use a normal ASCII character like 'C' as a special non-ASCII indicator with a special meaning.
(Also document section headers are sometimes all capitalized and might have character patterns that you don't want to replace.)
 
  • Like
Likes jack action and sysprog
  • #10
Python is not known for speed. I assume that your algorithm is fairly simple and that Python is just slow. I think there are ways to pre-compile Python so that it will be faster. If this program is to be used for large quantities of text processing, you might want to do that part with a separate program in a faster language. If this is to be used for large text documents, you might be surprised at how many things occur in text documents that require more logic than you anticipated.
 
  • #11
FactChecker said:
The hardest part, in general, would be to distinguish between a 'CxC' pattern that you want to replace versus an acronym with two 'C's that should stay as is. IMO, it is a mistake to use a normal ASCII character like 'C' as a special non-ASCII indicator with a special meaning.
(Also document section headers are sometimes all capitalized and might have character patterns that you don't want to replace.)
Thats no problem for my case. All strings that I will work are lowercase.
 
  • #12
FactChecker said:
Python is not known for speed. I assume that your algorithm is fairly simple and that Python is just slow. I think there are ways to pre-compile Python so that it will be faster. If this program is to be used for large quantities of text processing, you might want to do that part with a separate program in a faster language. If this is to be used for large text documents, you might be surprised at how many things occur in text documents that require more logic than you anticipated.
Well I don't know any language other then python..and I kind of need it to work on python. But the length of the text will not be much longer...so it won't be a problem
 
  • #13
I tried to use
Code:
def my_replace(m):
    if <some condition>:
        return <replacement variant 1>
    return <replacement variant 2>

result = re.sub("\w+", my_replace, input)

but I couldn't make it work..any ideas ?
 
  • #14
Arman777 said:
I tried to use
Code:
def my_replace(m):
    if <some condition>:
        return <replacement variant 1>
    return <replacement variant 2>

result = re.sub("\w+", my_replace, input)

but I couldn't make it work..any ideas ?
This is just pseudocode. You need to replace it with real Python code appropriate for your problem. Exactly what code did you try?
 
  • #15
FactChecker said:
This is just pseudocode. You need to replace it with real Python code appropriate for your problem.
Yes indeed. I just don't know how to use re module. It says I can take a function but I am not sure how to use that function.
FactChecker said:
Exactly what code did you try?
Not worth sharing since its not useful
 
  • #16
I'm starting to suspect that this is a Python homework problem because it seems very artificial. In that case, I will only give hints on how to modify your Python code.

In case it is not a Python homework problem, below is some Perl code that will work. Put the original text in the file temp2.txt and it will print the modified result to STDOUT.

Perl:
$string = `type temp2.txt`;
$string =~ s/(C(\w)C)/uc($2)/ge;
print "$string\n";
 
  • #17
This is what you want:

Python:
import re

def to_camel_case(match):
    if match.group(1) is not None:
        return match.group(1).upper()

old_str = 'CıC CkCnow CwCho CyCou CaCre but CwChat CaCm CıC'
new_str = re.sub(r"C([^C])C", to_camel_case, old_str)

print(new_str)

I will leave it as an exercise for you to understand how it works.

But this is a much more useful (and fun!) use of regular expressions and python (note that there are no 'C' in the original string):

Python:
import re

def to_camel_case(match):
    if match.group(2) is not None:
        if match.group(2) not in ['but', 'and', 'of']:
            return  match.group(1) + match.group(3).upper() + match.group(4)
        else:
            return match.group(1) + match.group(2)

old_str = 'ı know who you are but what am ı'
new_str = re.sub(r"(^|[\s.,;:!?()])(([^\s.,;:!?()])([^\s.,;:!?()]*))(?=$|[\s.,;:!?()])", to_camel_case, old_str)

print(new_str)

If I was more fluent in python, I could make a better regular expression than that, but re seems to use a limited version. The best tool to learn about regular expression is regex101.com.
 
  • Like
Likes FactChecker and Arman777
  • #18
FactChecker said:
I'm starting to suspect that this is a Python homework problem because it seems very artificial.
Well I am going to use it somewhere but its not homework.
FactChecker said:
In case it is not a Python homework problem, below is some Perl code that will work. Put the original text in the file temp2.txt and it will print the modified result to STDOUT.
I did not ask for a perl code. I don't know PERL or how to run it.
 
  • #19
jack action said:
This is what you want:

Python:
import re

def to_camel_case(match):
    if match.group(1) is not None:
        return match.group(1).upper()

old_str = 'CıC CkCnow CwCho CyCou CaCre but CwChat CaCm CıC'
new_str = re.sub(r"C([^C])C", to_camel_case, old_str)

print(new_str)

I will leave it as an exercise for you to understand how it works.

But this is a much more useful (and fun!) use of regular expressions and python (note that there are no 'C' in the original string):

Python:
import re

def to_camel_case(match):
    if match.group(2) is not None:
        if match.group(2) not in ['but', 'and', 'of']:
            return  match.group(1) + match.group(3).upper() + match.group(4)
        else:
            return match.group(1) + match.group(2)

old_str = 'ı know who you are but what am ı'
new_str = re.sub(r"(^|[\s.,;:!?()])(([^\s.,;:!?()])([^\s.,;:!?()]*))(?=$|[\s.,;:!?()])", to_camel_case, old_str)

print(new_str)

If I was more fluent in python, I could make a better regular expression than that, but re seems to use a limited version. The best tool to learn about regular expression is regex101.com.
Its nice but does not work for this case

xstr = 'CaCarmanCpopC'

It should have produce

xstr = 'AarmanPOP`

but instead it produces

AarmanCpopC

so it ignores other C values.
 
  • #20
Arman777 said:
My algorithm was to take the index of each `C` letter. Pair them as 2. Get index values between them and then turn strings into uppercase letters based on these index.
That's basically what the regex version is doing.

Arman777 said:
However, it is taking too long...
That's because Python is doing it in bytecode instructions, whereas the regex version is using the underlying C implementation for regular expressions, which will be a lot faster. But the algorithm itself is basically the same either way. There's no magical shortcut to finding the "C"s and uppercasing the letters between them.
 
  • #21
FactChecker said:
I think there are ways to pre-compile Python so that it will be faster.
If you're running Python bytecode, you're running Python bytecode. "Pre-compiling", for Python, just means compiling Python source code to bytecode in advance. That won't make much difference compared to the overhead of bytecode while actually running the algorithm.

There is the option of trying other interpreters, such as PyPy, that use various tricks to optimize how Python bytecode is run. For this problem, with a short string, that probably won't do much; but for a very large body of text, it might since the PyPy optimizer will have more opportunities to optimize.
 
  • #22
PeterDonis said:
That's basically what the regex version is doing.
But my code takes more then 10-20 lines regex might take 5 lines maybe less

PeterDonis said:
That's because Python is doing it in bytecode instructions, whereas the regex version is using the underlying C implementation for regular expressions, which will be a lot faster. But the algorithm itself is basically the same either way. There's no magical shortcut to finding the "C"s and uppercasing the letters between them.
I did not mean in terms of speed of the running time but in terms of me writing the code :)

Code:
re.sub('C(\w)C', lambda s: s.group(1).upper(), xstr)
This seems to be working, but it has the problem that, if the C has multiple values it fails. Such as,

For

xstr = 'CaCCvaC'

the above code produces

a = 'ACvaC'

but it should produce AVA.
 
  • #23
Arman777 said:
Its nice but does not work for this case

xstr = 'CaCarmanCpopC'

It should have produce

xstr = 'AarmanPOP`

but instead it produces

AarmanCpopC

so it ignores other C values.
Your example didn't have anything with multiple letters between the 'C's.
In line 8 try new_str = re.sub(r"C([^C]+)C", to_camel_case, old_str)
or new_str = re.sub(r"C(\w+)C", to_camel_case, old_str)
Unfortunately, this will be fooled by any pair of 'C's that are part of real words. So it is most useful if there are no capital letters in the real text.
You may need to get familiar with Python regular expressions and try some things to get it to work the way you want it to.
 
  • Like
Likes jack action
  • #24
FactChecker said:
Your example didn't have anything with multiple letters between the 'C's.
In line 8 try new_str = re.sub(r"C([^C]+)C", to_camel_case, old_str)
or new_str = re.sub(r"C(\w+)C", to_camel_case, old_str)
Unfortunately, this will be fooled by any pair of 'C's that are part of real words. So it is most useful if there are no capital letters in the real text.
You may need to get familiar with Python regular expressions and try some things to get it to work the way you want it to.
Guys please. As I have said earlier. In the text I am working on there are no capital letters. So there will be no uppercase C.
Arman777 said:
All strings that I will work are lowercase.

Code:
re.sub('C(\w+?)C', lambda s: s.group(1).upper(), xstr)

This code works

FactChecker said:
have anything with multiple letters between the 'C's.
CpopC was the caseYou guys are really helpful, but sometimes I just need some spesific things. I know what I am doing. I know the difference between capital C and lowercase C and how can the code mix them up. Maybe I am 'new' in coding but I know that much.

Arman777 said:
But my code takes more then 10-20 lines regex might take 5 lines maybe less
We have seen that it takes only 1 line :)
 
  • #25
Arman777 said:
my code takes more then 10-20 lines regex might take 5 lines maybe less
Yes, because the regex version already has built-in functions that perform the operations you need, so you don't have to code them by hand.

Arman777 said:
I did not mean in terms of speed of the running time but in terms of me writing the code :)
Yes, I agree that's important. I've found that a great source of innovation in coding is programmer laziness. :wink:
 
  • #26
anorlunda said:
regex 2021.8.3 is a python package that you can install and use.
Python already has the built-in re module in the standard library.
 
  • Like
Likes Arman777
  • #27
Arman777 said:
This code works
As long as you're sure the characters in between the C's will all be lower case letters, yes. You could also make the regex more specific for that:

Python:
re.sub('C([a-z]+)C', lambda s: s.group(1).upper(), xstr)

Also, as shown in the example above, if you're sure there will be at least one lower case letter in between each pair of C's, you don't need the question mark in the regex, just the plus sign.
 
  • Like
Likes Arman777
  • #28
Arman777 said:
Guys please. As I have said earlier. In the text I am working on there are no capital letters. So there will be no uppercase C.
Code:
re.sub('C(\w+?)C', lambda s: s.group(1).upper(), xstr)

This code worksCpopC was the caseYou guys are really helpful, but sometimes I just need some spesific things. I know what I am doing. I know the difference between capital C and lowercase C and how can the code mix them up. Maybe I am 'new' in coding but I know that much.We have seen that it takes only 1 line :)
Sorry. You will get the best help if you are careful about the initial statement of the problem. The information about no capital letters and the example with more than one letter between the 'C's was not in the first post. It is hard for me to keep up with all the posts to get a clear picture of what is needed.
 
  • #29
Could just get your keyboard fixed.
 

What is an algorithm for converting strings?

An algorithm for converting strings is a set of steps or rules that can be followed to change the format or type of a string of characters. This can be useful in various scenarios such as data processing or language translation.

Why is finding the best option important in string conversion?

Finding the best option in string conversion is important because it ensures that the resulting string is accurate and efficient. This means that the converted string should accurately represent the original string, and the conversion process should be as efficient as possible in terms of time and resources.

What factors should be considered when designing an algorithm for converting strings?

When designing an algorithm for converting strings, factors such as the input format of the string, the desired output format, and any special rules or patterns that need to be followed should be considered. The algorithm should also be designed to handle potential errors or exceptions that may occur during the conversion process.

What are some common challenges when converting strings?

Some common challenges when converting strings include handling different character sets or encodings, dealing with special characters or symbols, and preserving the original meaning or context of the string. Additionally, the complexity of the algorithm and potential errors or bugs can also present challenges.

Can the same algorithm be used for all types of string conversion?

No, the same algorithm may not be suitable for all types of string conversion. Factors such as the input and output formats, as well as the specific rules or patterns that need to be followed, may vary depending on the specific type of conversion. It is important to design and test the algorithm for each specific type of string conversion to ensure accuracy and efficiency.

Similar threads

  • Programming and Computer Science
Replies
1
Views
1K
Replies
31
Views
2K
Replies
2
Views
1K
  • Programming and Computer Science
Replies
6
Views
3K
  • MATLAB, Maple, Mathematica, LaTeX
Replies
31
Views
4K
Replies
2
Views
876
  • Beyond the Standard Models
Replies
2
Views
2K
  • Programming and Computer Science
Replies
1
Views
2K
  • Beyond the Standard Models
Replies
14
Views
3K
  • STEM Career Guidance
Replies
3
Views
2K
Back
Top