我有一个包含500万个字符串元素的列表,这些元素存储为pickle对象.
a = ['https://en.wikipedia.org/wiki/Data_structure','https://en.wikipedia.org/wiki/Data_mining','https://en.wikipedia.org/wiki/Statistical_learning_theory','https://en.wikipedia.org/wiki/Machine_learning','https://en.wikipedia.org/wiki/Computer_science','https://en.wikipedia.org/wiki/Information_theory','https://en.wikipedia.org/wiki/Statistics','https://en.wikipedia.org/wiki/Mathematics','https://en.wikipedia.org/wiki/Signal_processing','https://en.wikipedia.org/wiki/Sorting_algorithm','https://en.wikipedia.org/wiki/Data_structure','https://en.wikipedia.org/wiki/Quicksort','https://en.wikipedia.org/wiki/Merge_sort','https://en.wikipedia.org/wiki/Heapsort','https://en.wikipedia.org/wiki/Insertion_sort','https://en.wikipedia.org/wiki/Introsort','https://en.wikipedia.org/wiki/Selection_sort','https://en.wikipedia.org/wiki/Timsort','https://en.wikipedia.org/wiki/Cubesort','https://en.wikipedia.org/wiki/Shellsort']
Run Code Online (Sandbox Code Playgroud)
为了删除重复项,我使用set(a)
,然后我再次通过列表list(set(a))
.
我的问题是:
即使我重新启动python,并从pickle文件中读取列表,list(set(a))
每次的顺序是否相同?
我很想知道这个哈希 - >列表排序是如何工作的.
我测试了一个小数据集,它似乎有一致的排序.
In [50]: a = ['x','y','z','k']
In [51]: a
['x', 'y', 'z', 'k']
In [52]: list(set(a))
['y', 'x', 'k', 'z']
In [53]: b=list(set(a))
In [54]: list(set(b))
['y', 'x', 'k', 'z']
In [55]: del b
In [56]: b=list(set(a))
In [57]: b
['y', 'x', 'k', 'z']
Run Code Online (Sandbox Code Playgroud) 自从我将pandas升级到0.23.0后,我遇到了运行一行以删除空格的错误 df.any_column = df.any_column.str.replace(' ','')
我收到的错误消息如下:
/usr/local/lib/python3.5/dist-packages/pandas/core/strings.py in replace(self, pat, repl, n, case, flags, regex)
2427 def replace(self, pat, repl, n=-1, case=None, flags=0, regex=True):
2428 result = str_replace(self._data, pat, repl, n=n, case=case,
-> 2429 flags=flags, regex=regex)
2430 return self._wrap_result(result)
2431
/usr/local/lib/python3.5/dist-packages/pandas/core/strings.py in str_replace(arr, pat, repl, n, case, flags, regex)
637 raise TypeError("repl must be a string or callable")
638
--> 639 is_compiled_re = is_re(pat)
640 if regex:
641 if is_compiled_re:
/usr/local/lib/python3.5/dist-packages/pandas/core/dtypes/inference.py in is_re(obj)
217 """
218
--> 219 return isinstance(obj, …
Run Code Online (Sandbox Code Playgroud) PyTorch 似乎没有tensor.stride()
. 有人可以确认我的理解吗?
我的问题有三个。
Stride 用于访问存储中的元素。因此步幅大小将与张量的维度相同。正确的?
对于每个维度,stride 的相应元素表示沿着一维存储移动需要多少距离。正确的?
例如:
In [15]: x = torch.arange(1,25)
In [16]: x
Out[16]:
tensor([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18
, 19, 20, 21, 22, 23, 24])
In [17]: a = x.view(4,3,2)
In [18]: a
Out[18]:
tensor([[[ 1, 2],
[ 3, 4],
[ 5, 6]],
[[ 7, 8],
[ 9, 10],
[11, 12]],
[[13, 14],
[15, 16],
[17, 18]],
[[19, 20], …
Run Code Online (Sandbox Code Playgroud) 我搞砸了我的应用程序,最后做了一个
git checkout <commit number>
Run Code Online (Sandbox Code Playgroud)
回到我想去的地方。
当我做一个 git status 时,它说
Your branch is behind 'origin/master' by 5 commits, and can be fast-forwarded.
Run Code Online (Sandbox Code Playgroud)
我想我是最近一次提交的第 5 次。有没有办法让我最近提交的第 5 个提交成为我最近提交的第 5 个?几乎覆盖或丢弃我前四个提交中的所有内容?
更新。
我做了一个
git reset --hard <commit number>
Run Code Online (Sandbox Code Playgroud)
但是如何推送到我的存储库?它说...
To prevent you from losing history, non-fast-forward updates were rejected
Run Code Online (Sandbox Code Playgroud)
我不想执行合并,我非常想清除我最近的 4 次提交
谢谢
我最近在接受采访时得到了一个编程问题.
有2个链接列表.每个节点存储1到9的值(表示数字的一个索引).因此123将是链接列表1-> 2-> 3
任务是创建一个函数:
static LinkedListNode getSum(LinkedListNode a, LinkedListNode b)
这将返回2个链表列表中的值的总和.
如果阵列a是:1-> 2-> 3-> 4
阵列b为:5-> 6-> 7-> 8
答案应该是:6-> 9-> 1-> 2
这是我的算法:
遍历a和b中的每个节点,将值作为整数获取并添加它们.使用这些值创建新的链接列表.
这是代码:它大概是我假设的复杂度为O(n).一旦通过每个数组输入并一次创建输出数组.
有什么改进?更好的算法......或代码改进
public class LinkedListNode {
LinkedListNode next;
int value;
public LinkedListNode(int value) {
this.value = value;
this.next = null;
}
static int getValue(LinkedListNode node) {
int value = node.value;
while (node.next != null) {
node = node.next;
value = value * 10 + node.value;
}
return value;
}
static LinkedListNode getSum(LinkedListNode a, LinkedListNode …
Run Code Online (Sandbox Code Playgroud) 我使用python和java来运行斯坦福NER标记器,但我看到结果的差异.
例如,当我输入句子"参与使用ERwin作为主要软件的数据建模的所有方面.",
JAVA结果:
"ERwin": "PERSON"
Run Code Online (Sandbox Code Playgroud)
Python结果:
In [6]: NERTagger.tag("Involved in all aspects of data modeling using ERwin as the primary software for this.".split())
Out [6]:[(u'Involved', u'O'),
(u'in', u'O'),
(u'all', u'O'),
(u'aspects', u'O'),
(u'of', u'O'),
(u'data', u'O'),
(u'modeling', u'O'),
(u'using', u'O'),
(u'ERwin', u'O'),
(u'as', u'O'),
(u'the', u'O'),
(u'primary', u'O'),
(u'software', u'O'),
(u'for', u'O'),
(u'this.', u'O')]
Run Code Online (Sandbox Code Playgroud)
Python nltk包装器无法将"ERwin"作为PERSON捕获.
这里有趣的是Python和Java使用2015-04-20发布的相同训练数据(english.all.3class.caseless.distsim.crf.ser.gz).
我的最终目标是让python以与Java相同的方式工作.
我在nltk.tag中查看StanfordNERTagger,看看有什么我可以修改的.下面是包装代码:
class StanfordNERTagger(StanfordTagger):
"""
A class for Named-Entity Tagging with Stanford Tagger. The input is the paths to:
- a model trained …
Run Code Online (Sandbox Code Playgroud) 我想使用nltk从文本中提取所有国家和国籍提及,我使用POS标记来提取所有GPE标记的标记,但结果并不令人满意.
abstract="Thyroid-associated orbitopathy (TO) is an autoimmune-mediated orbital inflammation that can lead to disfigurement and blindness. Multiple genetic loci have been associated with Graves' disease, but the genetic basis for TO is largely unknown. This study aimed to identify loci associated with TO in individuals with Graves' disease, using a genome-wide association scan (GWAS) for the first time to our knowledge in TO.Genome-wide association scan was performed on pooled DNA from an Australian Caucasian discovery cohort of 265 participants with …
Run Code Online (Sandbox Code Playgroud) 我正在搜索400万行数据框中的一个子字符串或多个子字符串。
df[df.col.str.contains('Donald',case=True,na=False)]
Run Code Online (Sandbox Code Playgroud)
要么
df[df.col.str.contains('Donald|Trump|Dump',case=True,na=False)]
Run Code Online (Sandbox Code Playgroud)
DataFrame(df)如下所示(具有400万个字符串行)
df = pd.DataFrame({'col': ["very definition of the American success story, continually setting the standards of excellence in business, real estate and entertainment.",
"The myriad vulgarities of Donald Trump—examples of which are retailed daily on Web sites and front pages these days—are not news to those of us who have",
"While a fearful nation watched the terrorists attack again, striking the cafés of Paris and the conference rooms of San Bernardino"]})
Run Code Online (Sandbox Code Playgroud)
有什么技巧可以使此字符串搜索更快?例如,首先对数据框进行排序,某种索引方式,将列名更改为数字,从查询中删除“ na = False”等?即使是几毫秒的速度提高也将非常有帮助!
我在ubuntu机器上编译了一个cpp文件,并使用以下命令生成了链接器:
g++ -c -fPIC foo.cpp -o foo.o
g++ -shared -Wl,-soname,libfoo.so -o libfoo.so foo.o
Run Code Online (Sandbox Code Playgroud)
然后我在 python 中加载了 libfoo.so 链接器文件,如下所示。
from ctypes import *
lib = cdll.LoadLibrary('./lib/cppFunctions/libfoo.so')
Run Code Online (Sandbox Code Playgroud)
然后我可以在ubuntu上使用python使用cpp文件中的函数。
但是,当我尝试在 Mac 上加载 .so 文件 (libfoo.so) 时,出现以下错误。
OSError: dlopen(./lib/cppFunctions/libfoo.so, 6): no suitable image found. Did find:
./lib/cppFunctions/libfoo.so: unknown file type, first eight bytes: 0x7F 0x45 0x4C 0x46 0x02 0x01 0x01 0x00
Run Code Online (Sandbox Code Playgroud)
完整错误:
File "Resume.py", line 7, in <module>
from PhraseRecommender import *
File "/Users/aerin/Documents/bitbucket_BYOR/PhraseRecommender.py", line 4, in <module>
from SingleSentenceRecord import * #TODO: What …
Run Code Online (Sandbox Code Playgroud) 显然,休眠和停止是我可以选择的两个不同的操作。有什么不同?
python ×5
nlp ×2
nltk ×2
pandas ×2
algorithm ×1
amazon-ec2 ×1
c ×1
ctypes ×1
dataframe ×1
git ×1
java ×1
linked-list ×1
list ×1
macos ×1
numpy ×1
pos-tagger ×1
python-3.x ×1
pytorch ×1
set ×1
stanford-nlp ×1
stride ×1
string ×1
tensor ×1
ubuntu ×1