Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

C/++/# Algorithm to find the smallest substring containing a set of...

Tags:
  1. Aug 13, 2016 #1
    characters in a specified amount. I've been working on this for 10+ hours and can't seem to get it right. It passes all the unit tests I've written, but when I use it in the context of the program I'm writing it fails.

    The idea is like

    ACTAGACC
    { A : 2, C : 1}
    ------> 4, because smallest substring containing 2 As and 1 C is ACTA.

    For some reason, I'm finding this very tricky to write an efficient an elegant solution. The brute force approach (Find all substrings and see which is the smallest that contains the characters) doesn't meet the performance benchmarks that I have.

    My code:

    Code (Text):

    using System;
    using System.Collections.Generic;
    using System.Linq;

    public class MethodTest
    {
        public Dictionary<char,int> Counter { get; set; }
        public string Source { get; set; }
        public int ExpectedResult { get; set; }
    }
                       
    public class Program
    {
        public static void UnitTests()
        {
            var tests = new List<MethodTest>() {
                new MethodTest() { Source = "ABB", Counter = new Dictionary<char,int>() { { 'B', 1 } }, ExpectedResult = 1 },
                new MethodTest() { Source = "ABB", Counter = new Dictionary<char,int>() { { 'A', 1 } }, ExpectedResult = 1 },
                new MethodTest() { Source = "A", Counter = new Dictionary<char,int>() { { 'R', 1 } }, ExpectedResult = -1 },
                new MethodTest() { Source = "ATTC", Counter = new Dictionary<char,int>() { { 'T', 1 } }, ExpectedResult = 1 },
                new MethodTest() { Source = "AAAA", Counter = new Dictionary<char,int>() { { 'A', 3 } }, ExpectedResult = 3 },
                new MethodTest() { Source = "AAATTTGC", Counter = new Dictionary<char,int>() { { 'A', 1 }, { 'T', 1 } }, ExpectedResult = 2 },
                new MethodTest() { Source =  "GAAATAAA", Counter = new Dictionary<char,int>() { {'A', 4 } }, ExpectedResult = 5 }
            };
           
            foreach(var test in tests)
            {
                int result = SmallestSubstringContainingChars(test.Source,test.Counter);
                if(result != test.ExpectedResult)
                {
                    Console.WriteLine("Failed test: Source = {0}, result = {1}, expected = {2}", test.Source, result.ToString(), test.ExpectedResult.ToString());
                    return;
                }      
            }
            Console.WriteLine("All succeeded!");
        }
       
        static int SmallestSubstringContainingChars(string source, Dictionary<char, int> counter)
        {        
            int start = 0;
            while(start < source.Length && !counter.ContainsKey(source[start]))
                ++start;
            if(start == source.Length)
                return -1;
           
            // If here, start is the index of the first character of source that is in the counter.  
            // Create a pointer end that will increment from start. Keep track of the minimum difference
            // between the pointers whenever the pointers have spanned a substring that contains all the
            // characters in the counter in their specified coun.
           
           int? minspan = null;      
           for(int end = start; end < source.Length; ++end)
           {
                char c = source[end]; // character at end
             
                if(counter.ContainsKey(c)) // ch is in counter
                {
                    // Decrement count of character
                    counter[c] -= 1;
                   
                    // If the count is less than 0, meaning we found a surplus of those
                    // characters in the current range, and if that character was the
                    // one at the start, we can move the start forward to the next character
                    // in the string and in the counter.
                    if(counter[c] < 0 && source[start] == c)
                    {
                        while(!counter.ContainsKey(source[++start]));
                        counter[c] += 1;
                    }
                   
                }
               
              // If we've found all the characters in counter, then from start to end
              // we have a substring that fits our criteria. Updated minspan.
              if(counter.Values.All(count => count <= 0))
              {
                  int span = end - start;
                  minspan = minspan == null ? span : Math.Min((int)minspan, span);
              }
            }
           
            // If didn't find any range that spans all the characters in their specified count, return -1;
            // else, return the length of the spanned substring.
            return minspan == null ?  -1 : ((int)minspan + 1);
        }
       
        public static void Main()
        {
            UnitTests();
        }
    }
    [CODE]
     
  2. jcsd
  3. Aug 13, 2016 #2

    chiro

    User Avatar
    Science Advisor

    Hey SlurrerOfSpeech.

    If you want to make this more efficient then you will have to re-structure your data to "facilitate" the better computational complexity.

    To give you an idea of what I mean I will use the examples of binary trees for sorting and searching.

    A binary tree classifies everything to the left and right of something and by continually doing that, it allows you to order things without having to do the brute force search.

    You will need to do the same thing for your data structure (that is - a "string") and for the functions you use.

    You are looking for the smallest sub-string that contains two particular characters which hints that data structures are to be created to link to the characters themselves and relations to other neighboring characters.

    This hints at specifying something where you can project out these characteristics - like a vector where a projection returns the information on character frequency with another projection getting the relation of distance between characters.

    I don't know if you know mathematics, but what I am suggesting is that you find a way to represent frequency and distance of characters (i.e. distance between different realizations of "letters") and use that to solve an optimization problem where you are finding whether a string meets the frequency requirements and then if so how you can find out the sub-string that minimizes all possible distances.
     
  4. Aug 13, 2016 #3
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook

Have something to add?
Draft saved Draft deleted



Similar Discussions: Algorithm to find the smallest substring containing a set of...
Loading...