Java language detection with langdetect - how to load profiles?

Go To StackoverFlow.com

6

I'm trying to use a Java library called langdetect hosted here. It couldn't be easier to use:

Detector detector;
String langDetected = "";
try {
    String path = "C:/Users/myUser/Desktop/jars/langdetect/profiles";
    DetectorFactory.loadProfile(path);
    detector = DetectorFactory.create();
    detector.append(text);
    langDetected = detector.detect();
} 
catch (LangDetectException e) {
    throw e;
}

return langDetected;

Except with respect to the DetectFactory.loadProfile method. This library works great when I pass it an absolute file path, but ultimately I think I need to package my code and langdetect's companion profiles directory inside the same JAR file:

myapp.jar/
    META-INF/
    langdetect/
        profiles/
            af
            bn
            en
            ...etc.
    com/
        me/
            myorg/
                LangDetectAdaptor --> is what actually uses the code above

I will make sure that the LangDetectAdaptor which is located inside myapp.jar is supplied with both the langdetect.jar and jsonic.jar dependencies it needs for langdetect to work at runtime. However I'm confused as to what I need to pass in to DetectFactory.loadProfile in order to work:

  • The langdetect JAR ships with the profiles directory, but you need to initialize it from inside your JAR. So do I copy the profiles directory and put it inside my JAR (like I prescribe above), or is there a way to keep it inside langdetect.jar but access it from inside my code?

Thanks in advance for any help here!

Edit : I think the problem here is that langdetect ships with this profiles directory, but then wants you to initialize it from inside your JAR. The API would probably benefit from being changed a little bit to just consider profiles its own configuration, and to then provide methods like DetectFactory.loadProfiles().except("fr") in the event that you don't want it to initialize French, etc. But this still doesn't solve my problem!

2012-08-17 14:21
by IAmYourFaja


3

Looks like the library only accepts files. You can either change the code and try submitting the changes upstream. Or write your resource to a temp file and get it to load that.

2012-08-17 14:27
by vickirk
Thanks @vickirk (+1) - are you saying I have to specify every single language I want to load (I don't think that's the case). Can you be more specific and give a concrete code example? For instance what do you mean by "submit the changes upstream", or "write your resource to a temp file"? What resource?! Thanks again - IAmYourFaja 2012-08-17 14:31
In fact I know this isnt the case because my code snippet above (which uses the abs file path to the profiles directory) works like a charm.. - IAmYourFaja 2012-08-17 14:47
The library expects a String to a directory path or an actual File for the directory. It therefore can not get resources out of a jar. You would have to write the files to a directory on the file syste - vickirk 2012-08-17 14:57


5

I have the same problem. You can load the profiles from the LangDetect jar using JarUrlConnection and JarEntry. Note in this example I am using Java 7 resource management.

    String dirname = "profiles/";
    Enumeration<URL> en = Detector.class.getClassLoader().getResources(
            dirname);
    List<String> profiles = new ArrayList<>();
    if (en.hasMoreElements()) {
        URL url = en.nextElement();
        JarURLConnection urlcon = (JarURLConnection) url.openConnection();
        try (JarFile jar = urlcon.getJarFile();) {
            Enumeration<JarEntry> entries = jar.entries();
            while (entries.hasMoreElements()) {
                String entry = entries.nextElement().getName();
                if (entry.startsWith(dirname)) {
                    try (InputStream in = Detector.class.getClassLoader()
                            .getResourceAsStream(entry);) {
                        profiles.add(IOUtils.toString(in));
                    }
                }
            }
        }
    }

    DetectorFactory.loadProfile(profiles);
    Detector detector = DetectorFactory.create();
    detector.append(text);
    String langDetected = detector.detect();
    System.out.println(langDetected);
2013-03-11 05:51
by Mark Butler
Super ! But i notice there are 2 profiles directories ? profiles/ and profiles.sm - bertie 2013-06-05 10:52
Hi Albert, I don't know what the difference is between the two directories - you would need to add nakatani.shuyo who is responsible for langdetect? I don't think you can load both at the same time, because they overlap. If you need profiles.sm, just change the directory. We are mainly working with Mandarin, they are only in profiles, not profiles.sm, hence the code above - Mark Butler 2013-06-07 01:16
I got replies from Shuyo, stating that profiles.sm is language profiles for short messages i.e. twitter. I confidently assume 'sm' means 'short messages'. Hope this helps - bertie 2013-06-10 11:13
Found this here :) https://github.com/indyscala/cassandra-dem - JasonG 2014-05-19 20:27
The suggested solution is nice, but works only if the profiles directory is contained in the jar file. This is not the case if you use the jar file from the project's website - jolo 2014-08-21 17:27
Hi Jolo - I am getting the jar via Maven or SBT as described here: http://mvnrepository.com/artifact/com.cybozu.labs/langdetect/1.1-2012011 - Mark Butler 2014-08-26 14:26
com.cybozu.labs.langdetect.LangDetectException: Need more than 2 profiles 07-14 17:15:24.840 14147-14147/ W/System.err: at com.cybozu.labs.langdetect.DetectorFactory.loadProfile(DetectorFactory.java:104) 07-14 17:15:24.840 14147-14147/ W/System.err: at com.views.MainActivity$override.loadLibrary(MainActivity.java:114) 07-14 17:15:24.840 14147-14147/ W/System.err: at com.views.MainActivity$override.onCreate(MainActivity.java:38 - Deepak 2016-07-15 00:22


4

Since no maven-support was available, and the mechanism to load profiles was not perfect (since you you need to define files instead of resources), I created a fork which solves that problem:

https://github.com/galan/language-detector

I mailed the original author, so he can fork/maintain the changes, but no luck - seems the project is abandoned.

Here is an example of how to use it now (own profiles can be written where necessary):

DetectorFactory.loadProfile(new DefaultProfile()); // SmProfile is also available
Detector detector = DetectorFactory.create();
detector.append(input);
String result = detector.detect();
// maybe work with detector.getProbabilities()

I don't like the static approach the DetectorFactory uses, but I won't rewrite the full project, you have to create your own fork/pull request :)

2014-06-25 07:12
by Dag
Looks good, but how does one access that using maven - arnt 2015-01-04 14:56
@arnt There is no repository for the project yet. You have to install it by yourself using mvn install, that's all - Dag 2015-01-04 22:19
Doesn't work for me. Method DetectorFactory.LoadProfile(String) is not applicable ofr the arguments DefaultProfile - Berit Larsen 2015-06-09 13:49


1

Setting the working dir for me fixed the problem.

 String workingDir = System.getProperty("user.dir");
 DetectorFactory.loadProfile(workingDir+"/profiles/");
2015-05-20 07:35
by Martijn Mellens
Ads