String Operations - Transforming lower case letters to upper case between certain symbols

  • Python
  • Thread starter Arman777
  • Start date
  • #1
2,142
183
Let us suppose I have a string in this form,

string = 'CıCCkCnow CwCho CyCou CaCre but CwChat CaCm CıC'

Now I wont to take each word between 'C' and convert it into an upper case letter. For example, the above string should turn into

new_string = 'I know Who You Are but What Am I'

what kind of algorithm is best for this job ? I have come up with something but it seems really long and inefficient. Any ideas ?
 
Last edited by a moderator:

Answers and Replies

  • #4
PeterDonis
Mentor
Insights Author
2020 Award
35,623
13,803
I have come up with something but it seems really long and inefficient.
It's going to be hard for us to tell whether or not we agree with you if we don't see the algorithm.
 
  • #5
FactChecker
Science Advisor
Gold Member
6,576
2,644
The 'sub' function in Python allows you to have it call a function for each string that matches the pattern and return the string that you want to be substituted. That is what you want so that you can replace 'CxC' with 'X'. See this for a description.
 
  • Like
Likes sysprog and Arman777
  • #6
2,142
183
It's going to be hard for us to tell whether or not we agree with you if we don't see the algorithm.
My algorithm was to take the index of each `C` letter. Pair them as 2. Get index values between them and then turn strings into uppercase letters based on these index. However, it is taking too long...
 
  • #7
2,142
183
The 'sub' function in Python allows you to have it call a function for each string that matches the pattern and return the string that you want to be substituted. That is what you want so that you can replace 'CxC' with 'X'. See this for a description.
It looks good. I'll try this.
 
  • #8
anorlunda
Staff Emeritus
Insights Author
9,821
6,940
regex 2021.8.3 is a python package that you can install and use.
 
  • #9
FactChecker
Science Advisor
Gold Member
6,576
2,644
The hardest part, in general, would be to distinguish between a 'CxC' pattern that you want to replace versus an acronym with two 'C's that should stay as is. IMO, it is a mistake to use a normal ASCII character like 'C' as a special non-ASCII indicator with a special meaning.
(Also document section headers are sometimes all capitalized and might have character patterns that you don't want to replace.)
 
  • Like
Likes jack action and sysprog
  • #10
FactChecker
Science Advisor
Gold Member
6,576
2,644
Python is not known for speed. I assume that your algorithm is fairly simple and that Python is just slow. I think there are ways to pre-compile Python so that it will be faster. If this program is to be used for large quantities of text processing, you might want to do that part with a separate program in a faster language. If this is to be used for large text documents, you might be surprised at how many things occur in text documents that require more logic than you anticipated.
 
  • #11
2,142
183
The hardest part, in general, would be to distinguish between a 'CxC' pattern that you want to replace versus an acronym with two 'C's that should stay as is. IMO, it is a mistake to use a normal ASCII character like 'C' as a special non-ASCII indicator with a special meaning.
(Also document section headers are sometimes all capitalized and might have character patterns that you don't want to replace.)
Thats no problem for my case. All strings that I will work are lowercase.
 
  • #12
2,142
183
Python is not known for speed. I assume that your algorithm is fairly simple and that Python is just slow. I think there are ways to pre-compile Python so that it will be faster. If this program is to be used for large quantities of text processing, you might want to do that part with a separate program in a faster language. If this is to be used for large text documents, you might be surprised at how many things occur in text documents that require more logic than you anticipated.
Well I dont know any language other then python..and I kind of need it to work on python. But the length of the text will not be much longer...so it wont be a problem
 
  • #13
2,142
183
I tried to use


Code:
def my_replace(m):
    if <some condition>:
        return <replacement variant 1>
    return <replacement variant 2>

result = re.sub("\w+", my_replace, input)

but I couldnt make it work..any ideas ?
 
  • #14
FactChecker
Science Advisor
Gold Member
6,576
2,644
I tried to use


Code:
def my_replace(m):
    if <some condition>:
        return <replacement variant 1>
    return <replacement variant 2>

result = re.sub("\w+", my_replace, input)

but I couldnt make it work..any ideas ?
This is just pseudocode. You need to replace it with real Python code appropriate for your problem. Exactly what code did you try?
 
  • #15
2,142
183
This is just pseudocode. You need to replace it with real Python code appropriate for your problem.
Yes indeed. I just dont know how to use re module. It says I can take a function but I am not sure how to use that function.
Exactly what code did you try?
Not worth sharing since its not useful
 
  • #16
FactChecker
Science Advisor
Gold Member
6,576
2,644
I'm starting to suspect that this is a Python homework problem because it seems very artificial. In that case, I will only give hints on how to modify your Python code.

In case it is not a Python homework problem, below is some Perl code that will work. Put the original text in the file temp2.txt and it will print the modified result to STDOUT.

Perl:
$string = `type temp2.txt`;
$string =~ s/(C(\w)C)/uc($2)/ge;
print "$string\n";
 
  • #17
jack action
Science Advisor
Insights Author
Gold Member
2,234
3,944
This is what you want:

Python:
import re

def to_camel_case(match):
    if match.group(1) is not None:
        return match.group(1).upper()

old_str = 'CıC CkCnow CwCho CyCou CaCre but CwChat CaCm CıC'
new_str = re.sub(r"C([^C])C", to_camel_case, old_str)

print(new_str)

I will leave it as an exercise for you to understand how it works.

But this is a much more useful (and fun!) use of regular expressions and python (note that there are no 'C' in the original string):

Python:
import re

def to_camel_case(match):
    if match.group(2) is not None:
        if match.group(2) not in ['but', 'and', 'of']:
            return  match.group(1) + match.group(3).upper() + match.group(4)
        else:
            return match.group(1) + match.group(2)

old_str = 'ı know who you are but what am ı'
new_str = re.sub(r"(^|[\s.,;:!?()])(([^\s.,;:!?()])([^\s.,;:!?()]*))(?=$|[\s.,;:!?()])", to_camel_case, old_str)

print(new_str)

If I was more fluent in python, I could make a better regular expression than that, but re seems to use a limited version. The best tool to learn about regular expression is regex101.com.
 
  • Like
Likes FactChecker and Arman777
  • #18
2,142
183
I'm starting to suspect that this is a Python homework problem because it seems very artificial.
Well I am going to use it somewhere but its not homework.
In case it is not a Python homework problem, below is some Perl code that will work. Put the original text in the file temp2.txt and it will print the modified result to STDOUT.
I did not ask for a perl code. I dont know PERL or how to run it.
 
  • #19
2,142
183
This is what you want:

Python:
import re

def to_camel_case(match):
    if match.group(1) is not None:
        return match.group(1).upper()

old_str = 'CıC CkCnow CwCho CyCou CaCre but CwChat CaCm CıC'
new_str = re.sub(r"C([^C])C", to_camel_case, old_str)

print(new_str)

I will leave it as an exercise for you to understand how it works.

But this is a much more useful (and fun!) use of regular expressions and python (note that there are no 'C' in the original string):

Python:
import re

def to_camel_case(match):
    if match.group(2) is not None:
        if match.group(2) not in ['but', 'and', 'of']:
            return  match.group(1) + match.group(3).upper() + match.group(4)
        else:
            return match.group(1) + match.group(2)

old_str = 'ı know who you are but what am ı'
new_str = re.sub(r"(^|[\s.,;:!?()])(([^\s.,;:!?()])([^\s.,;:!?()]*))(?=$|[\s.,;:!?()])", to_camel_case, old_str)

print(new_str)

If I was more fluent in python, I could make a better regular expression than that, but re seems to use a limited version. The best tool to learn about regular expression is regex101.com.
Its nice but does not work for this case

xstr = 'CaCarmanCpopC'

It should have produce

xstr = 'AarmanPOP`

but instead it produces

AarmanCpopC

so it ignores other C values.
 
  • #20
PeterDonis
Mentor
Insights Author
2020 Award
35,623
13,803
My algorithm was to take the index of each `C` letter. Pair them as 2. Get index values between them and then turn strings into uppercase letters based on these index.
That's basically what the regex version is doing.

However, it is taking too long...
That's because Python is doing it in bytecode instructions, whereas the regex version is using the underlying C implementation for regular expressions, which will be a lot faster. But the algorithm itself is basically the same either way. There's no magical shortcut to finding the "C"s and uppercasing the letters between them.
 
  • #21
PeterDonis
Mentor
Insights Author
2020 Award
35,623
13,803
I think there are ways to pre-compile Python so that it will be faster.
If you're running Python bytecode, you're running Python bytecode. "Pre-compiling", for Python, just means compiling Python source code to bytecode in advance. That won't make much difference compared to the overhead of bytecode while actually running the algorithm.

There is the option of trying other interpreters, such as PyPy, that use various tricks to optimize how Python bytecode is run. For this problem, with a short string, that probably won't do much; but for a very large body of text, it might since the PyPy optimizer will have more opportunities to optimize.
 
  • #22
2,142
183
That's basically what the regex version is doing.
But my code takes more then 10-20 lines regex might take 5 lines maybe less

That's because Python is doing it in bytecode instructions, whereas the regex version is using the underlying C implementation for regular expressions, which will be a lot faster. But the algorithm itself is basically the same either way. There's no magical shortcut to finding the "C"s and uppercasing the letters between them.
I did not mean in terms of speed of the running time but in terms of me writing the code :)

Code:
re.sub('C(\w)C', lambda s: s.group(1).upper(), xstr)
This seems to be working, but it has the problem that, if the C has multiple values it fails. Such as,

For

xstr = 'CaCCvaC'

the above code produces

a = 'ACvaC'

but it should produce AVA.
 
  • #23
FactChecker
Science Advisor
Gold Member
6,576
2,644
Its nice but does not work for this case

xstr = 'CaCarmanCpopC'

It should have produce

xstr = 'AarmanPOP`

but instead it produces

AarmanCpopC

so it ignores other C values.
Your example didn't have anything with multiple letters between the 'C's.
In line 8 try new_str = re.sub(r"C([^C]+)C", to_camel_case, old_str)
or new_str = re.sub(r"C(\w+)C", to_camel_case, old_str)
Unfortunately, this will be fooled by any pair of 'C's that are part of real words. So it is most useful if there are no capital letters in the real text.
You may need to get familiar with Python regular expressions and try some things to get it to work the way you want it to.
 
  • Like
Likes jack action
  • #24
2,142
183
Your example didn't have anything with multiple letters between the 'C's.
In line 8 try new_str = re.sub(r"C([^C]+)C", to_camel_case, old_str)
or new_str = re.sub(r"C(\w+)C", to_camel_case, old_str)
Unfortunately, this will be fooled by any pair of 'C's that are part of real words. So it is most useful if there are no capital letters in the real text.
You may need to get familiar with Python regular expressions and try some things to get it to work the way you want it to.
Guys please. As I have said earlier. In the text I am working on there are no capital letters. So there will be no uppercase C.
All strings that I will work are lowercase.

Code:
re.sub('C(\w+?)C', lambda s: s.group(1).upper(), xstr)

This code works

have anything with multiple letters between the 'C's.
CpopC was the case


You guys are really helpful, but sometimes I just need some spesific things. I know what I am doing. I know the difference between capital C and lowercase C and how can the code mix them up. Maybe I am 'new' in coding but I know that much.

But my code takes more then 10-20 lines regex might take 5 lines maybe less
We have seen that it takes only 1 line :)
 
  • #25
PeterDonis
Mentor
Insights Author
2020 Award
35,623
13,803
my code takes more then 10-20 lines regex might take 5 lines maybe less
Yes, because the regex version already has built-in functions that perform the operations you need, so you don't have to code them by hand.

I did not mean in terms of speed of the running time but in terms of me writing the code :)
Yes, I agree that's important. I've found that a great source of innovation in coding is programmer laziness. :wink:
 

Related Threads on String Operations - Transforming lower case letters to upper case between certain symbols

  • Last Post
Replies
4
Views
3K
  • Last Post
Replies
2
Views
216
  • Last Post
Replies
2
Views
3K
Replies
12
Views
1K
  • Last Post
Replies
2
Views
2K
Replies
9
Views
803
Replies
6
Views
763
Replies
10
Views
145
Top