getting string from file without punctuation for spell checking outputting with original punctuation.

Go To StackoverFlow.com

1

Hi I'm making a spell checker in c that has a dictionary in an array of strings and uses binary search to find words in dictionary.

My problem is that I am trying to read text from a file and output the text back to a new file with wrong words highlighted like this: ** spellingmistake ** but the file will include characters such as .,!? which should be output to the new file but obviously not be present when comparing the word to the dictionary.

so I want this:

text file: "worng!"

new file: "** worng **!"

I've been trying to solve this the best I can and have spent quite a while on google, but am not getting any closer to a solution. I have written the following code so far to read each character and fill two char arrays one lower case temp for dictionary comparison and one input for original word which works if there is no punctuation but obviously I loose the space this way when punctuation is present I'm sure there is a better way to do this but I just can't find it so any pointers would be appreciated.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

#define MAX_STRING_SIZE 29  /*define longest non-technical word in english dictionary plus 1*/

/*function prototypes*/
int dictWordCount(FILE *ptrF);  /*counts and returns number of words in dictionary*/
void loadDictionary(char ***pArray1, FILE *ptrFile, int counter);   /*create dictionary array from file based on word count*/
void printDictionary(char **pArray2, int size); /*prints the words in the dictionary*/
int binarySearch(char **pArray3, int low, int high, char *value);   /*recursive binary search on char array*/

void main(int argc, char *argv[]){
    int i;  /*index*/
    FILE *pFile;    /*pointer to dictionary file*/
    FILE *pInFile;  /*pointer to text input file*/
    FILE *pOutFile; /*pointer to text output file*/
    char **dict;    /*pointer to array of char pointer - dictionary*/
    int count;      /*number of words in dictionary*/
    int dictElement;    /*element the word has been found at returns -1 if word not found*/

    char input[MAX_STRING_SIZE];    /*input to find in dictionary*/
    char temp[MAX_STRING_SIZE];
    char ch;    /*store each char as read - checking for punctuation or space*/
    int numChar = 0; /*number of char in input string*/

    /*************************************************************************************************/
    /*open dictionary file*/
    pFile = fopen("dictionary.txt", "r");   /*open file dictionary.txt for reading*/
    if(pFile==NULL){    /*if file can't be opened*/
        printf("ERROR: File could not be opened!/n");
        exit(EXIT_FAILURE);
    }

    count = dictWordCount(pFile);
    printf("Number of words is: %d\n", count);

    /*Load Dictionary into array*/
    loadDictionary(&dict, pFile, count);

    /*print dictionary*/
    //printDictionary(dict, count);
    /*************************************************************************************************/
    /*open input file for reading*/
    pInFile = fopen(argv[1], "r");
    if(pInFile==NULL){  /*if file can't be opened*/
        printf("ERROR: File %s could not be opened!/n", argv[1]);
        exit(EXIT_FAILURE);
    }
    /*open output file for writing*/
    pOutFile = fopen(argv[2], "w");
    if(pOutFile==NULL){ /*if file can't be opened*/
        printf("ERROR: File could not be created!/n");
        exit(EXIT_FAILURE);
    }

    do{
        ch = fgetc(pInFile);                /*read char fom file*/

        if(isalpha((unsigned char)ch)){     /*if char is alphabetical char*/
            //printf("char is: %c\n", ch);
            input[numChar] = ch;            /*put char into input array*/
            temp[numChar] = tolower(ch);    /*put char in temp in lowercase for dictionary check*/
            numChar++;                      /*increment char array element counter*/
        }
        else{
            if(numChar != 0){
                input[numChar] = '\0';  /*add end of string char*/
                temp[numChar] = '\0';

                dictElement = binarySearch(dict,0,count-1,temp);    /*check if word is in dictionary*/

                if(dictElement == -1){  /*word not in dictionary*/
                    fprintf(pOutFile,"**%s**%c", input, ch);
                }
                else{   /*word is in dictionary*/
                    fprintf(pOutFile, "%s%c", input, ch);
                }
                numChar = 0;    /*reset numChar for next word*/
            }
        }
    }while(ch != EOF);

    /*******************************************************************************************/
    /*free allocated memory*/
    for(i=0;i<count;i++){
        free(dict[i]);
    }
    free(dict);

    /*close files*/
    fclose(pInFile);
    fclose(pOutFile);

}
2012-04-04 18:32
by Astabh
not really sure why this has been voted down I have spent 2 days on this and tried different approaches but am well and truly stuck. I have also searched for a solution on multiple websites including this one and the code I have now is what I have been able to do with my research. I would like to remind people that I am a student and have only been learning c for about 9 weeks so asking for guidance from people who are more experienced is natural - Astabh 2012-04-04 19:37


1

I'm not 100% sure I've understood your problem correctly, but I'll give it a shot.

First, your loop

do{
    ch = fgetc(pInFile);
    /* do stuff */
}while(ch != EOF);

also runs when the end of file has been reached, so if the last byte of the file is alphabetical, you will either print an undesired EOF byte to the output file, or, since you cast ch to an unsigned char when passing it to isalpha(), which usually results in 255 [for EOF = -1 and 8 bit unsigned char], it will in some locales (en_US.iso885915, for example) be considered an alphabetic character, which results in suppressing the last word of the input file.

To deal with this, firstly, don't cast ch when passing it to isalpha(), and secondly add some logic to the loop to prevent unintentional handling of EOF. I chose to replace it with a newline if the need arises, since that's simple.

Then it remains to print out the non-alphabetic characters which don't immediately follow alphabetic characters:

do{
    ch = fgetc(pInFile);                /*read char fom file*/

    if(isalpha(ch)){                    /*if char is alphabetical char*/
        //printf("char is: %c\n", ch);
        input[numChar] = ch;            /*put char into input array*/
        temp[numChar] = tolower(ch);    /*put char in temp in lowercase for dictionary check*/
        numChar++;                      /*increment char array element counter*/
    }
    else{
        if(numChar != 0){
            input[numChar] = '\0';  /*add end of string char*/
            temp[numChar] = '\0';

            dictElement = binarySearch(dict,0,count-1,temp);    /*check if word is in dictionary*/

            if(dictElement == -1){  /*word not in dictionary*/
                fprintf(pOutFile,"**%s**%c", input, (ch == EOF) ? '\n' : ch);
            }
            else{   /*word is in dictionary*/
                fprintf(pOutFile, "%s%c", input, (ch == EOF) ? '\n' : ch);
            }
            numChar = 0;    /*reset numChar for next word*/
        }
        else
        {
            if (ch != EOF) {
                fprintf(pOutFile, "%c",ch);
            }
        }
    }
}while(ch != EOF);
2012-04-04 21:36
by Daniel Fischer
Thank you!! that was exactly what I was trying to do, but I just couldn't get my head around how. As for casting ch that was something I got from other examples, but I guess I didn't really look into the why of it properly - Astabh 2012-04-05 12:26
In general, casting the result of fgetc or getchar isn't necessary. Functions like isalpha take an int argument that must be EOF or the value of an unsigned char, what you get from fgetc, so there a cast isn't necessary and, if the result is EOF, may be harmful. For functions that can't handle EOF, you have to check for that before calling, and then casting can't do harm (and if the function takes an unsiged char as argument, may be necessary to avoid a compiler warning). Rule of thumb: don't cast unless you know it's necessary or the compiler tells you - Daniel Fischer 2012-04-05 12:47


0

It looks like right now if the char isn't alphabetical it triggers the else block for if(isalpha((unsigned char)ch)){ and the character itself gets ignored.

If you add a statement to just print all non-alphabetical characters out exactly as they come in, I think that'd accomplish what you want. This'd need to go inside that else block and after the if(numChar != 0){ block and would just be a simple fprintf statement.

2012-04-04 21:21
by TomNysetvold
Ads