Searching for a string in a file (C)

Epic Sandwich · Aug 10, 2010

Hi,

I've got a program which recalls strings randomly from a file. The file contains the strings, with each one taking up 200 bytes and if it's not long enough, I have spaces inserted after to make up the length. So they are all end to end, and there are no line breaks.

The program works out the length of the file, thus working out the number of strings stored in it ( length / 200 ). Then, it calls a random number and prints the string associated with that number.

That section of the program all works, however I am trying to write another program that adds strings to the file, and properly adding the spaces in afterwards to make the string 200 bytes long (since it would be tedious to do by hand). This works fine, however I am trying to implement a function that looks to see if the string is already in the file before it is written, to stop duplicates from appearing.

This is the function:

Code:

bool	CheckString( char * string )
{
	FILE *	file;		//Define file pointer
	long	fileLength;
	char *	currentFacts;
	char	tempString[ kMaxFactLength ];
	
	strcpy( tempString, string );
	
	tempString[ sizeof( tempString ) - 1 ] = ' ';
	
	if( ( file = fopen( kFactsFilePath, "r" ) ) == NULL )	//Open file
		return true;
	
	if( fseek( file, 0L, SEEK_END ) != 0 )					//Set pointer to end. . 
		Error( "fseek" );
	
	if( ( fileLength = ftell( file ) ) == -1L )				//. . and output position to work out size of file
		Error( "ftell ");
	
	if( ( currentFacts = malloc( sizeof( char ) * fileLength ) ) == NULL )	//Give currentFacts enough space to hold everything
		Error( "malloc" );
	
	if( fseek( file, 0L, SEEK_SET ) != 0 )					//Set pointer back to the start
		Error( "fseek" );
	
	if( fgets( currentFacts, fileLength, file ) == NULL )	//Read all the facts into currentFacts
		Error( "fread" );
	
	if( strstr( currentFacts, tempString ) == NULL )
	{
		free( currentFacts );
		fclose( file );
		return false;
	}
	
	free( currentFacts );
	fclose( file );
	return true;
}

You pass it a string, it opens the file, searches and then returns true if the string already exists in the file (and false if it doesn't). I have tried multiple ways of doing this, but so far this is how far I've got.

I open the entire file and store it in a string, then use strstr to work out if the entered string exists in the entire file.

I'm pretty sure I'm doing this wrong, however all my attempts have currently failed.

Can anyone help?

Thanks :)

schip666! · Aug 10, 2010

Wow, that is just about the most inefficient implementation I can think of...good work!
And isn't virtual memory wonderful? (malloc()ing the entire file size could grind your memory to a halt, but hey it probably works for most files).

I would place a small bet that you do not have a '\0' terminator on one or both of your search strings, or that your file contains many such false terminators. All C string functions expect to have their arguments terminated with a zero byte, otherwise they just happily continue reading the string until they run out of memory.

davee123 · Aug 10, 2010

I think the way I'd approach the problem is (in pseudocode, because it's been too long since I wrote anything in C):

Code:

if(string's length > 200) { shorten to 200, or return an error }
else if(string's length < 200) { pad the string out to 200 characters }

open the FILE

while(!EOF) {
  read in 200 bytes
  if(length of read in chunk != 200) { pad out to 200 characters, or return an error }
  if(string == read in chunk) {
    close FILE
    return success!
  }
}
return failure!

No need to keep the whole thing in memory-- you only need 200 bytes at a time to compare. Plus, you don't have to keep reading in the file after you find the string.

If you arranged the strings in order or something, and were worried about retreival time (if string compares are the limiting factor, and not file I/O or memory), then maybe you could do some fancy binary search tree to locate the record, but that's probably out of scope.

As is, I have to ask, what's this doing?

Code:

tempString[ sizeof( tempString ) - 1 ] = ' ';

[edit]What are you being passed in? Are you passing in a 200-character string, or a shorter one that you intend to pad out to 200 characters? What if the 200th character was a valid character? If you're passing in a 200-byte string, is the string terminator overwriting that bit of memory after the 200th character? Also, with fgets, don't you need to specify the length of the buffer you want to read, plus 1 for the null character?[/edit]

[edit]Another issue-- if you can't open the file, it looks like it returns true rather than false?[/edit]

DaveE

Epic Sandwich · Aug 10, 2010

It is a bit inefficient isn't it, I did try some other solutions and to be honest this was somewhat of a last resort, I think I would have tried to improve it's efficiency if I actually got it working. I have a feeling that it is something to do with the terminating 0 (or maybe in this case, the lack of one) so I'll mess around a bit with that in mind.

To Dave, thanks for that, I'll have a go at converting that to C sometime. Hopefully it'll be more successful than my previous attempts!

For your questions, I think the first one was to see if there was a newline printed in the string, and if there as, to remove it. I'm not sure if it was actually needed, but I left it in there. Chances are, it would probably delete the last character of a 200 character string, but to be honest I am still trying to get my head around strings, string functions, terminating 0s etc.

And yeh, that should be false. I changed true and false around at the bottom (originally it returned true if a match wasn't found, which is a bit illogical) but I must have forgotten to change that one. Thanks :)

Any more help is gladly appreciated! I shall have a go at getting it to work with all the help so far tomorrow.

EDIT:

Just found a way that works. Instead of reading in the string before it had been padded out, I read it in after through the global variable I had set. I think my main problem was mixing up terminating characters etc. Heres the code I got to work:

Code:

bool	CheckString( void )
{
	FILE	*	file;											//Establish file pointer
	char	stringToCompare[ kMaxFactLength + 1 ];				//Define char to hold imported string
	long	offset;												//Set the offset the file needs to read
	
	if( ( file = fopen( kFactsFilePath, "r" ) ) == NULL )		//Open file, if it doesn't exist, entry can't be a dupe!
		return false;
	
	for( int i = 0; fgetc( file ) != EOF; i++ )
	{
		offset = i * kMaxFactLength;							//Set seek offset
		
		if( fseek( file, offset, SEEK_SET ) != 0 )				//Sets correct offset
			Error( "fseek" );
		
		if( fread( stringToCompare, (size_t)kMaxFactLength, (size_t)1, file ) != 1 )	//Read in fact
			Error( "fread" );			
		
		stringToCompare[ kMaxFactLength ] = '\0';				//Set terminating character
		
		if( strcmp( gFact, stringToCompare ) == 0 )				//Compare the strings
		{
			fclose( file );
			return true;
		}
	}
	
	fclose( file );
	return false;
}

That seem alright? Thanks for everyones help :D

diggy · Aug 14, 2010

Not that its necessarily what you want to do, but this is one of those problems that would be considerably easier to implement in a scripting language like perl.

airborne18 · Aug 23, 2010

I actually don't see a problem with loading the complete file. It is fast and easy. I bet if you did timing tests you would find it is magnitudes faster than block reads..

In the old days we had to play games with memory and buffering, that is not an issue anymore. That is the point of 4G virtual memory. So why fight it. I doubt you will hit a size issue, if it is a concern then test for it and do buffered searches.

I remember a time when I test for the file size, and if it were less than 64k I would load it, otherwise I would buffer it.

I used to spend a lot of time benchmarking which buffer sizes would optimize operations like this, and it is not pretty. And it was because it mattered.

I am all for smart code, but part of it is fast code. Yeah I am usually in favor of conserving memory over code speed, but in the case of file IO, the larger the data block the better. ( at one time I had it figured that 2k was the perfect buffer size, but that was a lifetime ago in technology terms ).

schip666! · Aug 24, 2010

I think 2-4Kb was the memory page and disk block size in most UNIX systems. Don't know what's the standard these days though.

One would still need to be cautious when reading entire files into memory now that a simple "Hello World" text file can bloat into multi-kilobytes of useless formatting commands...But for most tasks virtual memory is a godsend.

airborne18 · Aug 24, 2010

schip,
I agree, you have to know that it is not "the" way to do it all the time, and you should understand the limits of the OS you are coding on.
BTW, my favorite programming is embedded. Just man and machine. I think everyone would be a better programmer if they started with embedded. My last embedded project was for a client who added flash memory to their device and wanted a remote code upgrade capability. So I wrote the flash upgrade mods. Uggh. Was done through some arcane serial protocol and had to have error checking,compression and security.

rcgldr · Aug 24, 2010

airborne18 said:

2k was the perfect buffer size

There was probably no limit for IDE hard drives using port I/O to transfer data, since virtual addresses were used. When DMA mode was used, then maximum size of a single transfer depended on the number of descriptors in the I/O card. Each descriptor contains a physical address and length (the memory had to be "locked" into place before the I/O started), and the I/O card just goes through the list of descriptors during a transfer.

On Intel systems using 4k virtual memory page sizes, and with some old SCSI cards only having 17 descriptors, maximum transfer was limited to 64K, since the first and last descriptors would start and end in the middle of virtual 4k pages. Other SCSI cards had 255 (why not 257?), and max transfer size was 1MB - 8k. I never found a limit for IDE or SATA controllers, I tested up to 2MB without issue. Perhaps they reprogram a set of descriptors during a transfer, since there is no maximum time between data transfers.

Getting back to the OP, assuming your PC has 1GB or more of ram might as well read in the entire file. If you want to speed things up use memchr(...) which searches for the next instance of the first character of the string, then do a memcmp(...) to compare strings. On an Intel cpu these are instructions to perform these operations (memchr = repne scasb, memcmp = repe cmpsb).

schip666! · Aug 25, 2010

airborne18 said:

schip,
BTW, my favorite programming is embedded. Just man and machine. I think everyone would be a better programmer if they started with embedded...

Man, Machine, and a bunch of questionable Debugging hacks... Since I started out in hardware, my transition to software went through "firmware" where I was the guy that could get around the silly decisions the hardware guys had designed into their boards... It does help with understanding what is going on way underneath, but it seems only a few folks need to do that now. When I left the "profession" I was having difficulty explaining my specific skill sets, even to other technical folks. My last real job was so deep in a database server engine that I didn't even need to know SQL...

airborne18 · Aug 25, 2010

schip666! said:

Man, Machine, and a bunch of questionable Debugging hacks... Since I started out in hardware, my transition to software went through "firmware" where I was the guy that could get around the silly decisions the hardware guys had designed into their boards... It does help with understanding what is going on way underneath, but it seems only a few folks need to do that now. When I left the "profession" I was having difficulty explaining my specific skill sets, even to other technical folks. My last real job was so deep in a database server engine that I didn't even need to know SQL...

I spent more of my time consulting, than at regular employment. Once you get your first 3 years of experience, move on, get another fulltime job to get different experience. After about 6 years I started doing contract positions to get experience with as many projects as I could.

I always looked at it this way. The more the average programmer did not have to worry about the low level details, the more work there was for high end work dealing with the low level details. Even with the internet, there is still good work on the server side.

I made a lot of money being the person that could do the hardcore low level development.

Now my focus is on security. Network security specifically. Take wireless networks. It is not impossible to have a secure wireless network, but it is costly. For the average small business and even home users, there is no such thing as a secure wireless network.

I have a ton of little projects that take up my time, mostly with game theory and finance. But a few of them relate to optimizing network traffic for security. Not sure when I will ever get to it.

schip666! · Aug 25, 2010

Game theory eh? You might find my "GI's Dilemma" amusing:
http://www.etantdonnes.com/KINETIC09/GID_group.jpg
They play 8 different strategies against each other and you can download the results...

airborne18 · Aug 25, 2010

That is cool. You win, you have cooler toys than me.

My application is a bit more complex... it is to optimize investment risk for total return.

Until I fix my mathematica system I am in a holding pattern.

schip666! · Aug 25, 2010

Optimizing investment risk would be a pretty cool toy. Presuming that it can be done.
My personal opinion is that the information necessary to do it is not available, at least in the general case. Kinda like "Searching for a(ny) string in a file". Even Bayesian Inference is stochastic without input data.

Maybe we should start a new thread? Is there an econ heading? I probably wouldn't read it anyway...

airborne18 · Aug 25, 2010

Lol. I actually had this conversation, sortof, in an econ thread.

Well I am not using a stochastic model. Everyone does that, and it is based on a few flaws. ( first and foremost is that volatility and other real market metrics are static in the model ). And of course you have to buy into Random Walk, which I do not.

Which is in part where game theory comes into play, and graph theory.

Consider this analogy. Call it the Jailbirds dilema. You are playing monopoly. Early in the game you want to buy property, so if you end up in jail, you will do anything to get out. However, as the game progresses, your incentive to get out of jail diminishes. It of course will depend on the rules you follow; like if you can collect rent and build while in jail. So the motivation to stay or leave monopoly jail is not simply your state, but the state of the game as a whole.

Yes, probability enters the decision, like whether you can get wiped out in the block that follows jail, but you have to make a choice and they each have a risk and probability.

airborne18 · Aug 25, 2010

Also this is not cooked yet. I am so busy with my coursework, and getting the kid back to college this week.

schip666! · Aug 26, 2010

Changing dynamics are not usually considered in basic Econ, nor is limited information. Adam Smith works great on paper where you can see all the players and know the numbers. More advanced models do consider these things. I don't dabble with it but you might find W. Brian Arthur and Martin Shubik interesting for game theory approaches.

Applying statistical mechanics to market dynamics is also a burgeoning field. Look for papers by Doyne Farmer and such. They are finding power-law relationships between various statistics which indicates that there may be non-random-walk drivers in the system. However, given the magnitude of the problem, probabilistic approaches could be the only tractable approach.

I stand by my assertion that the necessary data is not available. This is borne out by our on-going "economic crisis" where even the pros couldn't evaluate the risk in their mortgage portfolios because the fundamentals were not published by the providers. Plus they were greedy. That might be the best place to apply game theory, but my general impression is that it is only as good as your estimate of your opponents stupidity.

airborne18 · Aug 26, 2010

I appreciate you insight, sincerely.

I follow Eugene Fama, his research is very interesting. And yes I agree with all of your points.

But keep in mind that economics is a sport now, and every economist you see on tv has a bias. However, there are metrics which are undeniable, that are not smoothed through the economic modeling process.

Take unemployment, those calculations are worthless. What really is important is the fact that 17-18 million people are out of work. Forget that they are not filing, or not looking for work anymore, they are out of work.

That metric has not changed in 18 months. However, this year was a census year, and the government was going to hire millions of people. So was it really a hard projection by the economists that unemployment would drop over the winter? nope.

Another metric is the interest rate. ( there is a whole thread on this, and here is the why ). Interst rates are a key variable in all valuation models, it is dictated by GAAP, and it is a relevant metric. Low interest rates create an environment where businesses and traders must assume more risk to create a return. ( and the lower rates justify the higher risk ). It also forces asset valuations to change. A move up in rates, lowers the value of assets, a move down increases them.

Another metric, is the ratio of company revenue growth, in relation to the GDP and inflation. Currently company net earnings have grown in the recession in relation to all indicators. That is an issue, because the growth is through aggressive cost cutting and not top line revenue growth. That is contrary to sound business, with today's cheap rates and depressed equity it would make more sense to expand, but they are not.

There are discrete data points that can indicate ineffiencies in the market. ( more on this later, it is a key point ).

One issue with professional portfolios, such as mutal funds, is that they are limited by their charter. A fund manager has rules imposed. And the inflow of monthly contributions add risk to the portfolio. The limitations of the charter of the fund add risk. That is where modern portfolio theory fails investors who buy into these funds.

Most funds have to be invested, there are very few, "trade any way you think you can make money", the fund business has strayed from balanced funds to target niche etfs that are not investments, but risk arbitrage tools.

A minor point about statistical inference and probability distributions. Yes they work out over the long term simply by the law of large numbers. But I do not accept the notion that you include enough market history in your pricing until you get a smooth distribution.

You have to believe this: most moves in the market occur over very short periods of time, and the variance of volatility no longer fits a distribution. If we were talking pure continuous curves and limits, the limit of any pricing curve would be undefined when the market is the most volatile. So as risk and volatility increase, the market as a whole becomes inefficent.

I could go on for hours when I am heavily medicated. But this is a start.

schip666! · Aug 27, 2010

Doesn't "inefficient market" mean random?

airborne18 · Aug 27, 2010

Not in the terms of equity markets, it actually occurs when the market moves do not fit a probability curve. A simple test of this, take 20 days of pricing for any major equity and do the distribution and look at the bin widths and the counts. Yeah eventually if you keep adjusting your data set to include a larger set the law of large numbers will smooth it out.

But if you stay under 200 days, which is a trading year, you will see bias.

WHen the equity markets become inefficent, there is a bias that is not random.

That is my issue. Every model uses discrete variables, so why pick volatility as a discrete value. It makes no sense, because volatility is the one value that reflects the current market state in relation.

Searching for a string in a file (C)

Is A.I. more than the sum of its parts?

AI vs. Humans as Processors in an Environment

Sweetspot of data compression

Other than just FizzBuzz to test programmer candidates

How to show RS(U+TRS)* is equivalent to (R+SUT)SU?

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Searching for a string in a file (C)

Similar threads